Solved: Re: Scaling up writing csv file on GCS

rinus277 · 10-21-2023 02:37 PM

Hi everybody,
I am performing some test on a bucket in GCS in order to see if increasing the number of processes that writes in parallel in a bucket can reduce the writing times.
For example, a stupid write process (python+ google.cloud.storage library) which I can replicate on Openshift hundred of times I noticed that:

1) case : 300 pod, each one writes 50 files of 10MB each one = 150GB takes almost 25 min (15000 files)
2) case : 600 pod, each one writes 25 files of 10MB each one = 150GB takes almost 25 min (15000 files)
3) case : 900 pod, each one writes 17 files of 10MB each one = 153GB takes almost 25 min (15300 files)

More over: I followed the best practices generating filename with hashing name always different then

Anway, I did not notice any improvment on the total times, not scaling up as I expected.

ou have any suggestion?

KirTitievsky

Your single pod throughput seems to be ~0.3MB/s for your first test. Your aggregate bandwidth across the 300 pods is about 100MB/s. I would expect about that much for a single process writing 10MB files. Maybe there is something else going on. Can you report what you get when you do this on a single pod? Can you try much larger files (1GB) to test for any startup or over head effects? Also, any chance you can log timestamps around the actual upload calls to isolate startup issues?

One more thought: if you spin up the process from scratch for every file, you are re-authenticating every time. That would cost you 0.1-1s per invocation. So if you want to do this fast, make sure to re-use the same client instance (or credentials & connection, really) for every pod.

View solution in original post

KirTitievsky

Your single pod throughput seems to be ~0.3MB/s for your first test. Your aggregate bandwidth across the 300 pods is about 100MB/s. I would expect about that much for a single process writing 10MB files. Maybe there is something else going on. Can you report what you get when you do this on a single pod? Can you try much larger files (1GB) to test for any startup or over head effects? Also, any chance you can log timestamps around the actual upload calls to isolate startup issues?

One more thought: if you spin up the process from scratch for every file, you are re-authenticating every time. That would cost you 0.1-1s per invocation. So if you want to do this fast, make sure to re-use the same client instance (or credentials & connection, really) for every pod.