We are lookout for storage solutions which have high IOPS support like ssd disks, our workload run inside cloud run , which is high iops read/write intensive , current options like gcs bucket and firestore is have latency and slow read and write performance , and we dont wanted to use cloud sql, is there anyway we can integrate high performance storage with cloud run workload , any suggestions and help highly appreciated.
The workload we are using for data processing to move data using data lake on gcp
as part of data lake we are processing data under cloud run workload in a container , as we are processing files raw files comes in gcs bucket, when we pull the file from gcs to cloud run container , it takes long time almost hour , and cloud run bound with time line it get killed after one hour and larger 30GB+ size file pulled in containers failed, we are looking for the Stoarge which will have better speed when downloading inside cloud run container so this file size limit issue can be sorted.
PS : Max file size approximately 60GB .
let me know for any query.
@KirTitievsky can you please help here ?
Hm... 60GB is a lot, but not enormous. Say you have 0.5GB/a download speed. This will mean you spend 2 out of 5 minutes of the default Run request timeout downloading it. Assuming your output is the same size, you might have the same time uploading. Uploading 60 GB quickly may take some careful parallelization, especially if it's not going back to GCS. That leaves you with up to a minute to process. So timing may be a problem.
You also need to store this file somewhere. 60 GB probably goes to disk. So we need to make sure you are not running out of disk space on your Run instance.
All in all, you are pushing Run relatively hard here. So detailed design of your code matters.
How about we start with:
- sharing specific errors and issues you are seeing.
- measuring and sharing the timing of the processing steps (download, process, upload) by adding timers and log statements in the code?
- explaining how you manage ram and disk to make sure 60GB does not overflow
Another thought: if the data is structured (e.g. JSON, csv, avro, etc) you may be better off just loading into BigQuery, even if just to process and export. You might also define an external table over this data and have BigQuery process the incoming files without moving them.