Import multiple large image files into cloud stora...

Aftermath5428 · 03-11-2024 04:33 PM

I am looking for a way to import multiple 200mb+ image files into a folder of a storage bucket. My problem is that I don't control the source and have to deal with their exports which I will get via an API call.

My go-to way would be a cloud function requesting the images and then storing them in a bucket. After the upload of all files is finished I would like to run Cloud Run to analyze said images for objects. The code is in Python and I don't want to deal with cloudflow or vertex unless there is no other way around it.

My concern is that the limitations of cloud functions when it comes to the maximum runtime of said upload which will have to happen over the internet.

To sum it up:

Looking to either a way to parallelize the upload of multiple files from a single API or any other ideas that would help me stay under the one-hour limit of cloud functions. Buckets are used as multiple teams will be working with the same raw data and they are best familiar with Python.

Thank you

robertcarlos

Hi @Aftermath5428,

Welcome to Google Cloud Community!

I would suggest the following including their functionality:

Resumable uploads
- Resumable uploads are the recommended method for uploading large files, because you don't have to restart them from the beginning if there is a network failure while the upload is underway.
Streaming uploads
- This is useful when you want to upload data but don't know the final size at the start of the upload, such as when generating the upload data from a process, or when compressing an object on-the-fly.

For resumable uploads, please take note of the following:

The Python client library uses a buffer size that's equal to the chunk size. 100 MiB is the default buffer size used for a resumable upload, and you can change the buffer size by setting the blob.chunk_size property.
To always perform a resumable upload regardless of object size, use the class storage.BlobWriter or the method storage.Blob.open(mode='w'). For these methods, the default buffer size is 40 MiB. You can also use Resumable Media to manage resumable uploads.
The chunk size must be a multiple of 256 KiB (256 x 1024 bytes). Larger chunk sizes typically make uploads faster, but note that there's a tradeoff between speed and memory usage.

For streaming uploads, please check the following documentation and sample code for your reference:

Hope this helps.

Aftermath5428

Thank you for the reply. The thing with resumable downloads is that I don't know how many files will be there and I don't even know their sizes. I will be downloading satellite data every month at least once. I define the area in my request and the provider does the rest in his response.

So I would lean towards streaming rather than resumable downloads. The biggest unknown yet is the size of the total download and the size of each chunk of said download. I still do belive that streaming the data will be the desired outcome.

The other thing I am concerned with is the time it will take to download all images, the whole catalog.

Cloud Functions v2 should have a max timeout of 1 hour. I hope that this is enough. Is there anything memory-related that I would have to take care of when streaming?

thank you

Import multiple large image files into cloud storage