Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Need (workaround) solution to improve BQ table reads in Apache beam sdk 250 java [Urgent]

It takes 1min 5 sec to read 19 rows of  BQ full table scan that has around 10 columns (size is 15kb)  with Apache beam sdk 250.  On Apache beam 244 it takes just 4 seconds. Same behaviour with direct runner and also Dataflow runner. Is there a workaround solution? I cannot avoid full table scan 

I am using the below code

PCollection<TableRow> tableRows =
pipeline.apply(
"Read from BigQuery",
BigQueryIO.readTableRows()
.from(
new TableReference()
.setProjectId(params.getTableProjectName())
.setDatasetId(params.getTableDatasetName())
.setTableId(params.getTableName()))
.withMethod(Method.DIRECT_READ));
Solved Solved
1 3 406
1 ACCEPTED SOLUTION

Hi @dheerajpanyam,

I understand your concerns and the complexity of your production environment. Given the constraints and the information provided, it might be beneficial to reach out directly Google Cloud support with detailed logs and metrics to get more specialized assistance. 

View solution in original post

3 REPLIES 3

There are several strategies you might want to consider to enhance the performance of BigQuery table reads in Apache Beam SDK 2.50 Java:

1. Use a Different BigQuery IO Method: Apache Beam supports two methods for reading data from BigQuery: direct read (Method.DIRECT_READ) and export (Method.EXPORT). While the direct read is the default and uses the BigQuery Storage API, the export method might be more efficient for certain workloads. To switch to the export method, modify your code as follows:

withMethod(Method.EXPORT)

Do note that the export method might introduce a slight latency since it requires exporting data from BigQuery to a temporary location before reading.

2. Switch Runners: If you're using the DirectRunner (which runs locally), consider switching to the DataflowRunner, especially if dealing with larger datasets. The DataflowRunner, being a distributed runner on Google Cloud, might offer better performance. The runner is typically specified during pipeline creation or execution, not within the I/O transform.

3. Optimize Your Query: Even though you mentioned that you cannot avoid a full table scan, ensure that any other parts of your query are optimized. Filtering and reducing the amount of data read can significantly improve performance.

The following code shows a simplified example of how to read data from BigQuery using the EXPORT method and the DataflowRunner:

 

import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
import org.apache.beam.sdk.io.gcp.bigquery.TableReference;

public class BigQueryReadWorkaround {

  public static void main(String[] args) {
    Pipeline pipeline = Pipeline.create(DataflowRunner.class);

    // Read from BigQuery using the EXPORT method.
    PCollection<TableRow> tableRows = pipeline.apply(
      "Read from BigQuery",
      BigQueryIO.readTableRows()
        .from(new TableReference()
          .setProjectId("my-project")
          .setDatasetId("my-dataset")
          .setTableId("my-table"))
        .withMethod(Method.EXPORT)
    );

    // Further processing of table rows.
    // ...

    pipeline.run();
  }
}

Additional Tips

  • Make sure that you have the necessary permissions and that the BigQuery Storage API is enabled if you are using the DIRECT_READ method.
  • Regularly check the Apache Beam documentation or community forums for any updates, known issues, or optimizations related to the SDK version that you are using.

Thanks @ms4446 . We have intermediate knowledge with BQ and DataFlow and sorry to say the answers seem very basic. We are talking about a large scale production deployment (not a PoC) with  years of effort that went  into designing and building the ETL architecture on GCP using BQ and  DataFlow with Apache Beam Java SDK. Anyway  let me address your answers in more detail.

1. It is pretty obvious that FILE EXPORTS are the slowest which is why we are using DIRECT table reads. As i mentioned clearly we cannot AVOID full table scan (meaning there is no scope to use predicates to filter query results or scope to choose only required columns. This may seem an antipattern but it makes since this is  our sandbox environment before deploying to production. The BQ table is a temp table we use for testing.

2. We do not have an option to choose to migrate to a different runner. Direct runner is our local DEV environment and DataFlow is our remote runner of choice.

3. Query cannot be optimized for reasons i mentioned in #1.

Hi @dheerajpanyam,

I understand your concerns and the complexity of your production environment. Given the constraints and the information provided, it might be beneficial to reach out directly Google Cloud support with detailed logs and metrics to get more specialized assistance.