Good day
I have a datasteam stream ingesting tables into cloud storage (4000+ tables at a 60 second file rotation interval.)
So, as you can imagine, I am generating a very large number of small gzipped json files. Millions per day.
I want to perform data lake maintenance, similar to compaction when using iceberg tables.
Is there a simple way to do this? Meaning to reduce many small json files into a few large ones on a regular basis?
Hi @Stev0198,
Issues like ineffective query speed, more metadata management, and greater storage costs are brought on by the large ingestion rate of small .gz JSON files into Cloud Storage. When handling data lakes with high-frequency intake, this is typical. To keep a data lake operating efficiently, frequent file compaction is necessary.
To handle millions of small .gz JSON files in Cloud Storage, you can compact them into larger files using Dataflow (Apache Beam). You may try this following solution/s:
This will improve query efficiency and reduce storage overhead. Hoping this would be helpful!