Solved: BQ I/O Connector - Storage Read API (Direct table ...

dheerajpanyam · 09-26-2023 08:21 PM

Hello,

We are seeing Dataflow pipelines taking 2x to 3x more time to run in Apache beam SDK ver 2.50 compared to Apache beam SDK ver 2.44. As part of troubleshooting we compared the DAGS in 2.44 and 2.50 and we are seeing BQ read from table step in DAG (full table scan using DIRECT_TABLE_ACCESS) taking 3 sec to read 19 records / 13KB size in 2.44 and same exact pipeline with exactly same 19 records and 13KB size taking 1 min 5 sec in 2.50. Is this because this API has degraded in ver 2.50 since I also see throughput for this DAG step is much higher in 2.44 than 2.50. Please find the throughput graph images (elements/sec) below for both versions below

Throughput in ver 2.44 --> 0.15 sec (High)

Throughput in ver 2.50 --> 0.083 sec (Low)

Apache beam ver 2.44Apache beam ver 2.50

dheerajpanyam

Thanks much @ms4446 I shall open a ticket with beam support.

View solution in original post

ms4446

Here are some enhanced troubleshooting steps and considerations:

Small Scale Reproduction:

Try running your pipeline with a smaller sample of data to see if the performance issue is reproducible. This can make it easier to isolate and report the issue.

Different Beam Runners:

Try running your pipeline with different Beam runners (e.g., DirectRunner, DataflowRunner) to check if the issue is specific to a particular runner.

Environment and Configuration:

Ensure the testing environment, machine type, and cluster configuration are consistent for different version tests. Review the pipeline code and configuration for any unintentional changes or settings impacting performance.

Monitoring and Logging:

Enable detailed logging and monitoring for the pipeline. Utilize Google Cloud's monitoring and logging tools to analyze logs and metrics for additional insights into the performance bottleneck.

Additional Considerations:

Evaluate if the BigQuery table is partitioned, located in a different region, or if there are other jobs running on the same Dataflow cluster. These factors can contribute to performance degradation.

Documentation Review:

Check the official documentation and known issues list for Apache Beam and Google Cloud BigQuery for any documented solutions or workarounds for similar performance issues.

dheerajpanyam

@ms4446

Small Scale Reproduction:

That is exactly what I did. As you can see dataset is very small (19 records and 13KB data read)

Different Beam Runners:

I have tried this option also and I am seeing same behaviour both in Direct Runner and DataFlow runner.

For the rest of the options I have just one comment - It is the exact same pipeline including input data that is acting on the same dataset and table only thing different is the apache beam SDK version.

ms4446

The fact that the performance issue is reproducible on a small scale and with different Beam runners suggests that the issue is most likely related to the Beam SDK itself.

I would recommend that you open a bug report with the Beam team to report the issue. Be sure to include all of the relevant information, such as the Beam SDK version that you are using, the BigQuery table that you are reading from, and the throughput graph images that you have provided.

In the meantime, you may want to consider using Beam SDK version 2.44 until the performance issue in version 2.50 is resolved.

I apologize for the inconvenience that this is causing you.

dheerajpanyam

Thanks much @ms4446 I shall open a ticket with beam support.

BQ I/O Connector - Storage Read API (Direct table access) - Performance degrade seen in Apache B2.50