Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Need help with optimizing GCS backup using Dataflow (10TB+ bucket, tar + gzip approach)

Hi guys, I'm a beginner to cloud in general and I'm trying to back up a very large GCS bucket (over 10TB in size) using Dataflow. My goal is to optimize storage by first tarring the whole bucket, then gzipping the tar file, and finally uploading this tar.gz file to a destination gcs bucket.

However, the problem is that GCS doesn't have actual folders or directories, which makes using the tar method difficult. As such, I need to stream the files on the fly into a temporary tar file, and then later upload this file to the destination.

The challenge is dealing with disk space and memory limitations on each VM instance. Obviously, we can’t store the entire 10TB on a single VM, and I’m exploring the idea of using parallel VMs to handle this task. But I’m a bit confused about how to implement this approach and the risk of race conditions if we use parallel Vms. So I'm thinking about vertical scaling on 1 VM machine only, does this sound like a good solution to this problem?  (Update: to simplify this, I'm thinking about vertical scaling on one VM instead, 8vCPU 32GB memory 1TB SSD took 47s for .tar creation on 2.5GB folder, a .tar.gz compressed a similar folder from 2.5GB to 100MB)

Has anyone implemented something similar, or can provide insights on how to tackle this challenge efficiently?

Any tips or advice would be greatly appreciated! Thanks in advance.

Solved Solved
0 1 230
1 ACCEPTED SOLUTION

Hi @chocopiesogood,

Welcome to the Google Cloud community!

There are a few key aspects and best practices in Google Cloud documentation that you should consider.

Using Dataflow for Parallel Processing: Dataflow is a fully managed service that can perform scalable and parallelized processing. You can leverage Apache Beam (the SDK behind Dataflow) to process the files in parallel across multiple worker VMs. Parallelizing file processing can use the object names (i.e., the "path" part of the URL) to batch files logically. Apache Beam can help you process these files in parallel.

Best practices when writing to Cloud Storage: It's generally a good idea to avoid setting a specific number of shards, as this allows the system to automatically choose the best value for your scale. Cloud Storage is capable of handling a very large number of requests per second. Writing to Cloud Storage is typically more efficient when each write is larger (1KB or more). For file names, it's helpful to use non-sequential names to distribute the load more evenly. For more details, refer to guidelines on using naming conventions to spread the load. Also, avoid using the "@" symbol followed by a number or an asterisk ("*") in file names, as these are reserved for sharding purposes.

This document provides a collection of best practices for Cloud Storage. You can use it as a quick reference to help guide you when building applications that rely on Cloud Storage.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

View solution in original post

1 REPLY 1

Hi @chocopiesogood,

Welcome to the Google Cloud community!

There are a few key aspects and best practices in Google Cloud documentation that you should consider.

Using Dataflow for Parallel Processing: Dataflow is a fully managed service that can perform scalable and parallelized processing. You can leverage Apache Beam (the SDK behind Dataflow) to process the files in parallel across multiple worker VMs. Parallelizing file processing can use the object names (i.e., the "path" part of the URL) to batch files logically. Apache Beam can help you process these files in parallel.

Best practices when writing to Cloud Storage: It's generally a good idea to avoid setting a specific number of shards, as this allows the system to automatically choose the best value for your scale. Cloud Storage is capable of handling a very large number of requests per second. Writing to Cloud Storage is typically more efficient when each write is larger (1KB or more). For file names, it's helpful to use non-sequential names to distribute the load more evenly. For more details, refer to guidelines on using naming conventions to spread the load. Also, avoid using the "@" symbol followed by a number or an asterisk ("*") in file names, as these are reserved for sharding purposes.

This document provides a collection of best practices for Cloud Storage. You can use it as a quick reference to help guide you when building applications that rely on Cloud Storage.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.