Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

BQ I/O Connector - Storage Read API (Direct table access) - Performance degrade seen in Apache B2.50

Hello,

  We are seeing Dataflow  pipelines taking 2x to 3x more time to run in Apache beam SDK ver 2.50 compared to Apache beam SDK ver 2.44. As part of troubleshooting we compared the DAGS in 2.44 and 2.50 and we are seeing BQ read from table step in DAG (full table scan using DIRECT_TABLE_ACCESS) taking 3 sec to read 19 records / 13KB size in 2.44 and same exact pipeline with exactly same 19  records   and 13KB size taking 1 min 5 sec in 2.50. Is this because this API has degraded in ver 2.50 since I also see throughput for this DAG step is much higher in 2.44 than 2.50. Please find the  throughput graph images (elements/sec)  below for both versions below 

Throughput in ver 2.44 --> 0.15 sec (High)

Throughput in ver 2.50 --> 0.083 sec (Low)

Apache beam ver 2.44Apache beam ver 2.44Apache beam ver 2.50Apache beam ver 2.50

 

 

Solved Solved
0 4 316
1 ACCEPTED SOLUTION

Thanks much @ms4446 I shall open a ticket with beam support.

View solution in original post

4 REPLIES 4

Here are some enhanced troubleshooting steps and considerations:

Small Scale Reproduction:

  • Try running your pipeline with a smaller sample of data to see if the performance issue is reproducible. This can make it easier to isolate and report the issue.

Different Beam Runners:

  • Try running your pipeline with different Beam runners (e.g., DirectRunner, DataflowRunner) to check if the issue is specific to a particular runner.

Environment and Configuration:

  • Ensure the testing environment, machine type, and cluster configuration are consistent for different version tests. Review the pipeline code and configuration for any unintentional changes or settings impacting performance.

Monitoring and Logging:

  • Enable detailed logging and monitoring for the pipeline. Utilize Google Cloud's monitoring and logging tools to analyze logs and metrics for additional insights into the performance bottleneck.

Additional Considerations:

  • Evaluate if the BigQuery table is partitioned, located in a different region, or if there are other jobs running on the same Dataflow cluster. These factors can contribute to performance degradation.

Documentation Review:

  • Check the official documentation and known issues list for Apache Beam and Google Cloud BigQuery for any documented solutions or workarounds for similar performance issues.
  1.  

@ms4446 

Small Scale Reproduction:

That is exactly what I did. As you can see dataset is very small (19 records and 13KB data read)

Different Beam Runners:

I have tried this option also and I am seeing same behaviour both in Direct Runner and DataFlow runner.

For the rest of the options I have just one comment  - It is the exact same pipeline including input data  that is acting on the same dataset and table only thing different is the apache beam SDK version.

 

 

 

The fact that the performance issue is reproducible on a small scale and with different Beam runners suggests that the issue is most likely related to the Beam SDK itself.

I would recommend that you open a bug report with the Beam team to report the issue. Be sure to include all of the relevant information, such as the Beam SDK version that you are using, the BigQuery table that you are reading from, and the throughput graph images that you have provided.

In the meantime, you may want to consider using Beam SDK version 2.44 until the performance issue in version 2.50 is resolved.

I apologize for the inconvenience that this is causing you.

Thanks much @ms4446 I shall open a ticket with beam support.