Resolving datastream small file issue

Stev0198 · 09-16-2024 12:23 PM

Good day

I have a datasteam stream ingesting tables into cloud storage (4000+ tables at a 60 second file rotation interval.)

So, as you can imagine, I am generating a very large number of small gzipped json files. Millions per day.

I want to perform data lake maintenance, similar to compaction when using iceberg tables.

Is there a simple way to do this? Meaning to reduce many small json files into a few large ones on a regular basis?

Stev0198

Anyone dealt with this issue in the past?

mcbsalceda

Hi @Stev0198,

If you're looking to manage and compact your gzipped JSON files, Google Cloud Storage has some great options for you:

Google Cloud Dataflow: This is a super flexible platform for building data processing pipelines. Whether you need real-time analytics or batch processing, Dataflow can handle it all. You can set up a Dataflow pipeline using Apache Beam to read those small JSON files, merge them together, and save them as larger files. Check out the Apache Beam documentation for more info on building your pipelines.

Google Cloud Storage Transfer Service: Storage Transfer Service automates the transfer of data to, from, and between different storage systems. It’s designed to move large amounts of data quickly and reliably, and the best part is that you don’t need to write any code to use it.

I hope the above information is helpful.