Re: Is it possible to get data into GCS in hourly ...

arindamsarkar · 12-09-2024 02:21 AM

I have a huge table in my Postgres Cloud SQL. I have a column in it as `insertion_timestamp` along with ~400 other columns and the size of table is ~1TB. I want to use Datastream to migrate data from Postgres into GCS in Avro format. But I also need the to partition it in the hourly basis in the GCS location. Like create partition date wise and inside date wise I need partition in hourly basis. How can we achieve this?

Nandha100

Hi @arindamsarkar

Step 1: Set up DataStream for PostgreSQL to GCS Migration
Datastream provides an efficient way to capture changes from PostgreSQL and stream them to Cloud Storage.

Enable Datastream API:

Go to Google Cloud Console and enable the Datastream API for your project.
Create Datastream Connection Profile:

Create a source profile for PostgreSQL by connecting to your PostgreSQL Cloud SQL instance.
Create a destination profile for GCS, where you specify the GCS bucket where the Avro files will be stored.
Create a Datastream Stream:

Create a stream to migrate the data from your PostgreSQL database to GCS.
Ensure that you choose the Avro format for the data output in GCS.
Step 2: Partition Data in GCS (by Date and Hour)
Since Datastream itself does not provide partitioning out-of-the-box for data in GCS, you will need to reprocess the data to partition it by insertion_timestamp (date and hour) after it has been streamed to GCS.

Option 1: Using Google Cloud Functions
You can trigger a Cloud Function whenever a new Avro file lands in your GCS bucket. The Cloud Function can read the Avro file, extract the insertion_timestamp, and then move the file to the appropriate partition in GCS.

Cloud Function Trigger:

Set up a trigger on the GCS bucket where the Avro files are being written.
The trigger can activate a Cloud Function every time a new Avro file is uploaded to GCS.
Cloud Function Logic:

Inside the function, read the Avro file.
Extract the insertion_timestamp from each record in the Avro file.
Use the insertion_timestamp to create the appropriate folder structure (e.g., YYYY/MM/DD/HH/).
Move the Avro file to the corresponding partition path in the bucket.

Is it possible to get data into GCS in hourly partitioned format in avro using Datastream