I have a huge table in my Postgres Cloud SQL. I have a column in it as `insertion_timestamp` along with ~400 other columns and the size of table is ~1TB. I want to use Datastream to migrate data from Postgres into GCS in Avro format. But I also need the to partition it in the hourly basis in the GCS location. Like create partition date wise and inside date wise I need partition in hourly basis. How can we achieve this?
Step 1: Set up DataStream for PostgreSQL to GCS Migration
Datastream provides an efficient way to capture changes from PostgreSQL and stream them to Cloud Storage.
Enable Datastream API:
Go to Google Cloud Console and enable the Datastream API for your project.
Create Datastream Connection Profile:
Create a source profile for PostgreSQL by connecting to your PostgreSQL Cloud SQL instance.
Create a destination profile for GCS, where you specify the GCS bucket where the Avro files will be stored.
Create a Datastream Stream:
Create a stream to migrate the data from your PostgreSQL database to GCS.
Ensure that you choose the Avro format for the data output in GCS.
Step 2: Partition Data in GCS (by Date and Hour)
Since Datastream itself does not provide partitioning out-of-the-box for data in GCS, you will need to reprocess the data to partition it by insertion_timestamp (date and hour) after it has been streamed to GCS.
Option 1: Using Google Cloud Functions
You can trigger a Cloud Function whenever a new Avro file lands in your GCS bucket. The Cloud Function can read the Avro file, extract the insertion_timestamp, and then move the file to the appropriate partition in GCS.
Cloud Function Trigger:
Set up a trigger on the GCS bucket where the Avro files are being written.
The trigger can activate a Cloud Function every time a new Avro file is uploaded to GCS.
Cloud Function Logic:
Inside the function, read the Avro file.
Extract the insertion_timestamp from each record in the Avro file.
Use the insertion_timestamp to create the appropriate folder structure (e.g., YYYY/MM/DD/HH/).
Move the Avro file to the corresponding partition path in the bucket.