Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Dataflow MongoDB to BigQuery template: Missing foreign keys in BigQuery table

Hello, I'm new to Dataflow and have created a pipeline to transfer data from a MongoDB collection to a BigQuery table using MongoDB to BigQuery template. However, I've encountered an issue where data with a foreign key referencing another table in the MongoDB database (of type ObjectId) is missing in the BigQuery table. Can you please explain why this is happening and suggest a solution to fix it?

Solved Solved
2 3 521
1 ACCEPTED SOLUTION

UPD: Dataflow MongoDB to BigQuery doesn't parse columns with type ObjectId. I fixed it by adding a UDF function, which parses data from MongoDB. Here is the function:

function extractCompanyId(obj) {
    if (obj && obj.company && typeof obj.company.$oid !== 'undefined') {
        return obj.company.$oid;
    } else {
        return null;
    }
}

function processVehicles(inJson) {
    // Parse the input JSON string
    var obj = JSON.parse(inJson);
    
    // Extract company ID if present
    obj.company_id = extractCompanyId(obj);
  
    // Convert the modified JSON object back to string
    return JSON.stringify(obj);
}

View solution in original post

3 REPLIES 3

By the way, I don't create a schema in the BigQuery table. But I think it still doesn't really help...

I have the data in the MongoDB in the following format:
```
{
"_id": "661405f6d08f928ee5c1923c",
"company": "65e73e48019b0576d763c021",
"color": "black",
"deleted": false,
"createdAt": "2024-04-08T14:57:58.965Z",
"updatedAt": "2024-04-09T08:29:44.743Z",
}
```
Where company is a foreign key.

In the BigQuery I get it like this:
```

{
"id": "661405f6d08f928ee5c1923c",
"source_data": "{\"_id\":{\"timestamp\":1712588278,\"counter\":12685884,\"randomValue1\":13668242,\"randomValue2\":-28955},\"company\":{\"timestamp\":1709653576,\"counter\":6537249,\"randomValue1\":105221,\"randomValue2\":30423},\"color\":\"black\","createdAt\":\"Apr 8, 2024, 2:57:58 PM\",\"updatedAt\":\"Apr 9, 2024, 8:29:44 AM\"}",
"timestamp": "2024-04-09 10:54:50.151000 UTC",
}

```

UPD: Dataflow MongoDB to BigQuery doesn't parse columns with type ObjectId. I fixed it by adding a UDF function, which parses data from MongoDB. Here is the function:

function extractCompanyId(obj) {
    if (obj && obj.company && typeof obj.company.$oid !== 'undefined') {
        return obj.company.$oid;
    } else {
        return null;
    }
}

function processVehicles(inJson) {
    // Parse the input JSON string
    var obj = JSON.parse(inJson);
    
    // Extract company ID if present
    obj.company_id = extractCompanyId(obj);
  
    // Convert the modified JSON object back to string
    return JSON.stringify(obj);
}

Hi @happyhooter the issue you're describing might be related to how the MongoDB to BigQuery template handles data. ObjectId fields in MongoDB don’t always convert automatically into a readable format in BigQuery, which can cause certain values to not transfer correctly.

Possible Solutions:

🔹 Convert ObjectId to String before transferring – In MongoDB, you can add an extra field that stores the ObjectId as a string to ensure proper conversion.

🔹 Preprocess data in Dataflow – If you're using Apache Beam in Dataflow, you can apply a transformation to convert ObjectId fields before writing to BigQuery.

🔹 Check your BigQuery schema – Ensure that the corresponding column in BigQuery is set to the correct data type to handle the incoming values properly.

A Simpler Alternative

If you want a more flexible solution without dealing with these manual adjustments, tools like Windsor.ai can help seamlessly integrate MongoDB with BigQuery, handling compatibility issues and transformations automatically.

Hope this helps!

Top Labels in this Space