Re: Dataflow template - Streaming - Mongo DB CDC t...

mastersam987 · 08-03-2024 01:43 PM

I am using the Dataflow Mongo cdc to Bigquery template to create a pipeline that listens to Mongo DB change streams and load the new data to Bigquery. I am using Debezium server to read the changestream and push those to pub/sub which connects to Dataflow for inserting or updating the records in Bigquery.
The pipeline is able to connect and work properly. I am able to push the changestream (mongo records) to pub/sub. But the process is failing in the Dataflow stage.

The error states that: "the destination table has no schema".

I have created the table in Bigquery with no schema as there are several columns in the mongo db collection and it is dynamic in nature as the column count keeps increasing every other month.
The Mongo DB to BQ Dataflow batch template works fine if I have the destination table with no schema.
I was of the assumption that the Dataflow pipeline function for Bigquery by default would detect the schema. Attaching the Dataflow error snippet below:

ms4446

The main issue with your Dataflow pipeline is a mismatch between the dynamic nature of your MongoDB collection's schema and the Dataflow MongoDB CDC to BigQuery template's expectations. In MongoDB, the schema can evolve over time, with new fields being added to documents as needed. However, the Dataflow template assumes a fixed schema when writing to BigQuery, which becomes problematic when the destination BigQuery table is created without a predefined schema.

When you attempt to insert data into a BigQuery table without a schema, the Dataflow pipeline fails because it requires schema information to properly validate and process the incoming data. This issue is compounded by the differences in flexibility between batch and streaming processing. While batch jobs may handle schema changes with more leniency, streaming jobs, particularly those using CDC, are typically more rigid, necessitating a consistent schema.

To address this issue, several strategies can be employed:

Define an Initial Schema in BigQuery: Creating the BigQuery table with a basic schema that covers the essential fields can resolve the immediate error. However, this approach requires manual updates to the schema as new fields are added to the MongoDB collection, making it less scalable over time.
Implement Schema Detection: A more dynamic solution involves implementing a custom schema detection step within your Dataflow pipeline. Tools like Apache Beam's Schema class can be used to analyze the incoming MongoDB documents, infer the current schema, and update the BigQuery table accordingly. While this approach is more complex to implement, it provides greater flexibility in handling evolving schemas.
Leverage BigQuery's Schema Auto-Detection: Although BigQuery offers schema auto-detection capabilities, this feature is generally more applicable to batch jobs and might not be reliable for streaming pipelines dealing with rapidly changing schemas. Caution is advised if using this approach, as it can lead to unexpected results.
Explore Flexible Schema Formats: Using data formats like Avro or JSON, which naturally support schema evolution, can also be beneficial. These formats allow for more flexibility in handling dynamic data, though they may require additional configuration within the pipeline.

mastersam987

Thanks for the input.

For the above situation, assuming there is no change in schema for the Mongo DB collection, it will still not output the data even when we create the Bigquery table with the required columns. As the Mongodb changestream listened by the Debezium connector, pushes those events to Pub/Sub and the provided Dataflow template for MongoDB CDC to Bigquery is unable to infer the messages passed through Pub/Sub to flatten them to the required schema or the provided Bigquery table.
Also, since I am using the template, i feel there is less flexibility with the optional paramaters that we can add such as Schema Auto-Detection which can be included if I write a custom dataflow script.
I would generally prefer to use the template as there are several collections that needs to pushed into Bigquery - that have to be flattened.

ms4446

In your situation, even if the MongoDB schema remains unchanged, the Dataflow MongoDB CDC to BigQuery template might still have trouble processing the data. This is because the template doesn’t automatically flatten or transform the Pub/Sub messages to fit the schema of your BigQuery table.

The Dataflow template is designed to make the process easier but doesn’t offer much flexibility for complex tasks like flattening nested data structures or adjusting schemas. So, even if you create a BigQuery table with all the required columns, the template might not correctly insert the data because it doesn't reshape the data from Pub/Sub to match your BigQuery schema.

Since you’re working with multiple collections that need to be flattened and prefer using the template for its ease of use, one solution is to add a preprocessing step. This step could involve using a Cloud Function or a lightweight Dataflow job to listen to the Pub/Sub topic, flatten and format the data to match your BigQuery schema, and then send it to another Pub/Sub topic. The Dataflow template could then process this correctly formatted data.

Dataflow template - Streaming - Mongo DB CDC to Bigquery