Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

pub/sub direct to bigquery problem

Greetings, my name is Farizi, I am a data engineer, currently, I have tried the new feature from pub/sub which is to write directly to bigquery, but with no success, my current flow is :

  1. Create datastream with GCS profile and Cloud SQL Mysql profile
  2. Stream Mysql CDC data ( Avro ) and store it in GCS
  3. Create Pub/sub topic with binary encoding
  4. Create Pub/sub notification inside GCS
  5. Create write to bigquery pub/sub subscription

So far, Datastream Avro files are already stored into GCS, and I have defined the schema manually for the pub/sub topic itself, and I have created bigquery table too. Still, the data itself is not available in the table, even though I have chosen to use the topic schema inside the subscription.

Is there any documentation to use this new feature if we want to use datastream together? please answer ASAP because this feature is excellent and can replace the dataflow ability to stream into big query.

Solved Solved
0 4 1,569
1 ACCEPTED SOLUTION

Hi Farizi,

I might be missing something from your description, but Pub/Sub notifications from GCS only carry the path to the file in GCS, not the actual payload. To get the data into BQ, you need a Dataflow job to process the data from GCS and load it into BQ. There's a detailed guide to accomplish this, available here: https://cloud.google.com/datastream/docs/implementing-datastream-dataflow-analytics

Etai

View solution in original post

4 REPLIES 4

Hi Farizi,

I might be missing something from your description, but Pub/Sub notifications from GCS only carry the path to the file in GCS, not the actual payload. To get the data into BQ, you need a Dataflow job to process the data from GCS and load it into BQ. There's a detailed guide to accomplish this, available here: https://cloud.google.com/datastream/docs/implementing-datastream-dataflow-analytics

Etai

ah that's a shame, I thought this feature can replace dataflow, but thanks for the quick reply!

Hi etaim, I've already followed the tutorial that you give to me, and here is the new flow:

  1. Create GCS bucket
  2. Create Pub/sub topic, notification inside GCS, and subscription
  3. Create datastream with GCS profile and Cloud SQL Mysql profile
  4. Stream Mysql CDC data ( Avro ) and store it in GCS
  5. Create a dataflow template job ( Datastream to Bigquery )

The data is successfully loaded in the bigquery dataset, but all of the data types have been converted into a string in the bigquery column, note that the datastream output is avro files in GCS, is there any step that I have missed to keep the bigquery schema the same as the source? 

Hi Farizi, doesn't sound like you missed any step, but maybe something wasn't done exactly right? Off the top of my head I can't think of a reason why this would happen. If you can, I suggest opening a support ticket so that we can properly investigate.

Etai

Top Solution Authors