Datastream small file issue

Stev0198 · 12-03-2024 10:07 PM

Good day

I have a datasteam stream ingesting tables into cloud storage (4000+ tables at a 60 second file rotation interval.)

So, as you can imagine, I am generating a very large number of small gzipped json files. Millions per day.

I want to perform data lake maintenance, similar to compaction when using iceberg tables.

Is there a simple way to do this? Meaning to reduce many small json files into a few large ones on a regular basis?

greb

Hi @Stev0198,

Issues like ineffective query speed, more metadata management, and greater storage costs are brought on by the large ingestion rate of small .gz JSON files into Cloud Storage. When handling data lakes with high-frequency intake, this is typical. To keep a data lake operating efficiently, frequent file compaction is necessary.
To handle millions of small .gz JSON files in Cloud Storage, you can compact them into larger files using Dataflow (Apache Beam). You may try this following solution/s:

Create a Dataflow Pipeline:
This will read small files from Cloud Storage, batch them, and write back larger files (e.g., .gz, .parquet, or .avro).
Schedule the Pipeline:
Use Cloud Scheduler with Cloud Functions or Composer to run the process regularly (daily/hourly).
Use Efficient Formats:
Convert to formats like Parquet or Avro for better performance and compression, especially if using BigQuery.
Set Storage Lifecycle Rules:
Automatically delete old small files to save storage costs.

This will improve query efficiency and reduce storage overhead. Hoping this would be helpful!