Good day
I have a datasteam stream ingesting tables into cloud storage (4000+ tables at a 60 second file rotation interval.)
So, as you can imagine, I am generating a very large number of small gzipped json files. Millions per day.
I want to perform data lake maintenance, similar to compaction when using iceberg tables.
Is there a simple way to do this? Meaning to reduce many small json files into a few large ones on a regular basis?
Anyone dealt with this issue in the past?
Hi @Stev0198,
If you're looking to manage and compact your gzipped JSON files, Google Cloud Storage has some great options for you:
I hope the above information is helpful.