Dataflow: Latency in 'Kafka to table in CloudSQL d...

iistomin · 07-23-2024 06:26 AM

Hi.

I have a Dataflow pipeline (in Java) which reads messages from Kafka topic and writes row updates to CloudSQL (Postgres).

The pipeline doesn't perform any aggregations and transformation steps take 4-5 milliseconds. According to metrics, reading of Kafka record from topic takes 400-500 ms. But Dataflow shows that processing latency is 8-10 seconds. I can't understand where time is spent. How to achieve sub-5-second latency ?

For JDBC write I use batching and sharding (JdbcIO.write().withBatchSize().withAutoSharding().Also, connection pooling is enabled, CloudSQL instance has enough resources.

jangemmar

Hi @iistomin,

Welcome to Google Cloud Community!

The significant contributor to latency seems to be Kafka message consumption (400-500 ms per record). This suggests the bottleneck might lie in the communication between your Dataflow pipeline and the Kafka topic. The way Dataflow works is it tries to prioritize processing data latency more than CPU utilization and tries to keep the data freshness below 10 seconds.

If your Kafka topic contains a large volume of data or messages are particularly large, consider strategies like compaction or reducing message size to minimize the time spent fetching data. You can also see the partition size by using the script /bin/kafka-log-dirs.sh.

/bin/kafka-log-dirs.sh --describe --bootstrap-server <KafkaBrokerHost>:<KafkaBrokerPort> --topic-list <YourTopic>

If you really need to achieve sub-5-second latency, consider using Cloud Pub/Sub as a message broker instead of Kafka. Cloud Pub/Sub has tight integration with Dataflow and can potentially provide lower latency.

This page describes best practices for reading from Pub/Sub in Dataflow.

I hope the above information is helpful.

iistomin

Thank you for answer. I will check options you suggested.

Dataflow: Latency in 'Kafka to table in CloudSQL db' pipeline