Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Apache beam local and Dataflow pipelines run way slower after upgrade to 2.50

We are seeing increased execution times (3x and 4x) after upgrading from Apache Beam 2.44 to Apache Beam 2.50. This happens for both local development (direct runner) and also the Dataflow version on GCP. We are using DataFlow with Java.

Impact is more with Direct Runner though (3X increased execution time) after moving to Apache Beam 2.50

I can only see GKE label but not Dataflow (there seems to be a bug) even though this is related to DataFlow

0 2 557
2 REPLIES 2

This is possibly due to latency issue to your API,  Also it looks like a known issue is currently in progress for version 2.47 above here

Mitigation

Until Beam 2.51.0 is released. 

consider any of the following workarounds:

  • Use apache-beam==2.46.0 or below.


Please see link above for other workaround listed and including 2.51 release milestone here to keep you updated, Alternatively you can file a support case to Google Support here for further investigation: https://cloud.google.com/contact

 

@nceniza  FYI ...

1. The memory leak is related to Python SDK. We are using the Java SDK

2. There is no 3rd party API involved that we are using to fetch data rather the dataflow job is  reading a BQ table (input source  with 19 records) in both cases 244 and 250. In summary Dataflow job is exactly the same in all respects that runs in 244 and 250 including the input sink and output destination.

 

Top Labels in this Space
Top Solution Authors