I have a persistent volume that contains a lot of small files (>10 million). I've noticed that this has had detrimental effects on the mount time and my pods will occasionally hang until they get timed out.
I have tried using premium-rwo drives, but this doesn't seem to have helped at all.
My use case is an ETL workflow, so I have a few options available to me that I can think of. I could split the workload across many drives and process each one. Or I could try tarring all of the files and hope that the OS likes that better.
Wondering if anyone has done anything similar or has any other suggestions? Thanks.
When dealing with storage solutions, instead of relying solely on block storage (RWO disks), it’s worth considering alternatives like Google Filestore or parallel file systems such as Lustre or BeeGFS. Google Filestore, particularly its NFS High Scale tier, is optimized for high metadata workloads, making it ideal for scenarios involving millions of files. Unlike RWO disks, it supports parallel access, which can significantly enhance performance for certain use cases. However, one downside is that Filestore tends to be more expensive than local SSDs, so cost-effectiveness should be evaluated based on your specific needs.
For large-scale ETL workflows or other data-intensive operations, parallel file systems like Lustre or BeeGFS are excellent options. These distributed, high-performance systems are designed to handle millions of files efficiently and are well-suited for environments requiring massive scalability and speed. Google Cloud Parallel File System (GCFS) supports such solutions, providing a robust infrastructure for demanding workloads. Choosing the right storage solution ultimately depends on your workload requirements, performance goals, and budget constraints.
In summary to butress what @ jayeshmahajan wrote The best Approach: Move to GCS for better scalability. If PV is required, use Filestore.