Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Best Practices for Handling Large CSV Files in GCP using Cloud Function

Hi there,

I have a CSV file of around 10GB stored in a GCP bucket. I perform basic operations on this file, such as:

  • Adding a row_count column
  • Renaming columns
  • Formatting the CSV into a proper structure

These are relatively lightweight formatting tasks and not intensive data processing.

Currently, I am using a Cloud Function with the following configuration:

  • Memory: 8GB
  • CPU: 2

Is this setup sufficient for such tasks? Or should I consider using Dataflow or another service?
What would you recommend as a best practice for handling such operations?

Looking forward to your suggestions!

0 5 1,145
5 REPLIES 5

Hi Samvardhan,

Welcome to the Google Cloud Community!

Based on your use case, Cloud Functions might not be ideal for handling a 10GB CSV file due to their memory limits (8GB) and execution time limits (60 minutes). These constraints could cause performance issues or timeouts when processing large files. Additionally, Cloud Functions are designed for smaller, event-driven tasks and may struggle with large file processing.

For processing large files like your 10GB CSV, I recommend using Dataflow. Dataflow is designed for scalable, parallel processing and can efficiently handle large datasets by splitting the file processing into smaller tasks. It allows you to process large files without running into memory or time limits. You can create a Dataflow pipeline to read, transform, and write the CSV file. Here’s a guide on how to get started with a Dataflow Pipeline.

Alternatively, If you prefer a serverless solution with more control over resources, Cloud Run is a good choice. Cloud Run allows you to containerize your CSV processing code and run it with up to 32GB of memory and longer execution times. It also scales automatically based on demand, giving you control over memory and time limits.

If you still want to use Cloud Functions, you'll need to optimize your approach to avoid hitting memory or time limits. One way is to break the file into smaller chunks and process each chunk in a separate function call. You can also trigger Cloud Functions from file uploads to Cloud Storage to process the file incrementally. However, Cloud Functions are better suited for smaller files, and when working with larger files you may face performance bottlenecks or timeouts.

I hope the above information is helpful.

Hey   @nikacalupas 

Thanks for this information. It is really helpful. 

For processing smaller CSV files (around 800 MB to 2 GB), would it be better to use Cloud Functions or Cloud Run, considering both cost and performance factors?

Currently, we are using Cloud Functions with Eventarc triggers, configured with the following resources:

available_memory = "1Gi

available_cpu = "1"

timeout_seconds = 300

Would there be a better solution for this scenario?  pls advice on it

Thank you 

 

Hi @Samvardhan ,

Given your current setup and the size of the CSV files you are processing, Cloud Run is likely to offer better performance and cost efficiency. You can continue to use Eventarc triggers with Cloud Run to handle event-driven workloads.

Hey @nikacalupas  

Thanks for this information. This is really helpful

Hello
i have same issue.