Best Practices for Handling Large CSV Files in GCP...

Samvardhan · 11-25-2024 09:48 PM

Hi there,

I have a CSV file of around 10GB stored in a GCP bucket. I perform basic operations on this file, such as:

Adding a row_count column
Renaming columns
Formatting the CSV into a proper structure

These are relatively lightweight formatting tasks and not intensive data processing.

Currently, I am using a Cloud Function with the following configuration:

Memory: 8GB
CPU: 2

Is this setup sufficient for such tasks? Or should I consider using Dataflow or another service?
What would you recommend as a best practice for handling such operations?

Looking forward to your suggestions!

nikacalupas

Hi Samvardhan,

Welcome to the Google Cloud Community!

Based on your use case, Cloud Functions might not be ideal for handling a 10GB CSV file due to their memory limits (8GB) and execution time limits (60 minutes). These constraints could cause performance issues or timeouts when processing large files. Additionally, Cloud Functions are designed for smaller, event-driven tasks and may struggle with large file processing.

For processing large files like your 10GB CSV, I recommend using Dataflow. Dataflow is designed for scalable, parallel processing and can efficiently handle large datasets by splitting the file processing into smaller tasks. It allows you to process large files without running into memory or time limits. You can create a Dataflow pipeline to read, transform, and write the CSV file. Here’s a guide on how to get started with a Dataflow Pipeline.

Alternatively, If you prefer a serverless solution with more control over resources, Cloud Run is a good choice. Cloud Run allows you to containerize your CSV processing code and run it with up to 32GB of memory and longer execution times. It also scales automatically based on demand, giving you control over memory and time limits.

If you still want to use Cloud Functions, you'll need to optimize your approach to avoid hitting memory or time limits. One way is to break the file into smaller chunks and process each chunk in a separate function call. You can also trigger Cloud Functions from file uploads to Cloud Storage to process the file incrementally. However, Cloud Functions are better suited for smaller files, and when working with larger files you may face performance bottlenecks or timeouts.

I hope the above information is helpful.

Samvardhan

Hey @nikacalupas

Thanks for this information. It is really helpful.

For processing smaller CSV files (around 800 MB to 2 GB), would it be better to use Cloud Functions or Cloud Run, considering both cost and performance factors?

Currently, we are using Cloud Functions with Eventarc triggers, configured with the following resources:

available_memory = "1Gi

available_cpu = "1"

timeout_seconds = 300

Would there be a better solution for this scenario? pls advice on it

Thank you

nikacalupas

Hi @Samvardhan ,

Given your current setup and the size of the CSV files you are processing, Cloud Run is likely to offer better performance and cost efficiency. You can continue to use Eventarc triggers with Cloud Run to handle event-driven workloads.

Samvardhan

Hey @nikacalupas

Thanks for this information. This is really helpful

smit890

Hello
i have same issue.

Best Practices for Handling Large CSV Files in GCP using Cloud Function