It takes 1min 5 sec to read 19 rows of BQ full table scan that has around 10 columns (size is 15kb) with Apache beam sdk 250. On Apache beam 244 it takes just 4 seconds. Same behaviour with direct runner and also Dataflow runner. Is there a workaround solution? I cannot avoid full table scan
I am using the below code
Solved! Go to Solution.
Hi @dheerajpanyam,
I understand your concerns and the complexity of your production environment. Given the constraints and the information provided, it might be beneficial to reach out directly Google Cloud support with detailed logs and metrics to get more specialized assistance.
There are several strategies you might want to consider to enhance the performance of BigQuery table reads in Apache Beam SDK 2.50 Java:
1. Use a Different BigQuery IO Method: Apache Beam supports two methods for reading data from BigQuery: direct read (Method.DIRECT_READ
) and export (Method.EXPORT
). While the direct read is the default and uses the BigQuery Storage API, the export method might be more efficient for certain workloads. To switch to the export method, modify your code as follows:
withMethod(Method.EXPORT)
Do note that the export method might introduce a slight latency since it requires exporting data from BigQuery to a temporary location before reading.
2. Switch Runners: If you're using the DirectRunner (which runs locally), consider switching to the DataflowRunner, especially if dealing with larger datasets. The DataflowRunner, being a distributed runner on Google Cloud, might offer better performance. The runner is typically specified during pipeline creation or execution, not within the I/O transform.
3. Optimize Your Query: Even though you mentioned that you cannot avoid a full table scan, ensure that any other parts of your query are optimized. Filtering and reducing the amount of data read can significantly improve performance.
The following code shows a simplified example of how to read data from BigQuery using the EXPORT method and the DataflowRunner:
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
import org.apache.beam.sdk.io.gcp.bigquery.TableReference;
public class BigQueryReadWorkaround {
public static void main(String[] args) {
Pipeline pipeline = Pipeline.create(DataflowRunner.class);
// Read from BigQuery using the EXPORT method.
PCollection<TableRow> tableRows = pipeline.apply(
"Read from BigQuery",
BigQueryIO.readTableRows()
.from(new TableReference()
.setProjectId("my-project")
.setDatasetId("my-dataset")
.setTableId("my-table"))
.withMethod(Method.EXPORT)
);
// Further processing of table rows.
// ...
pipeline.run();
}
}
Additional Tips
Thanks @ms4446 . We have intermediate knowledge with BQ and DataFlow and sorry to say the answers seem very basic. We are talking about a large scale production deployment (not a PoC) with years of effort that went into designing and building the ETL architecture on GCP using BQ and DataFlow with Apache Beam Java SDK. Anyway let me address your answers in more detail.
1. It is pretty obvious that FILE EXPORTS are the slowest which is why we are using DIRECT table reads. As i mentioned clearly we cannot AVOID full table scan (meaning there is no scope to use predicates to filter query results or scope to choose only required columns. This may seem an antipattern but it makes since this is our sandbox environment before deploying to production. The BQ table is a temp table we use for testing.
2. We do not have an option to choose to migrate to a different runner. Direct runner is our local DEV environment and DataFlow is our remote runner of choice.
3. Query cannot be optimized for reasons i mentioned in #1.
Hi @dheerajpanyam,
I understand your concerns and the complexity of your production environment. Given the constraints and the information provided, it might be beneficial to reach out directly Google Cloud support with detailed logs and metrics to get more specialized assistance.