Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Dataflow MongoDB to BigQuery template: Missing foreign keys in BigQuery table

Hello, I'm new to Dataflow and have created a pipeline to transfer data from a MongoDB collection to a BigQuery table using MongoDB to BigQuery template. However, I've encountered an issue where data with a foreign key referencing another table in the MongoDB database (of type ObjectId) is missing in the BigQuery table. Can you please explain why this is happening and suggest a solution to fix it?

Solved Solved
2 3 699
1 ACCEPTED SOLUTION

UPD: Dataflow MongoDB to BigQuery doesn't parse columns with type ObjectId. I fixed it by adding a UDF function, which parses data from MongoDB. Here is the function:

function extractCompanyId(obj) {
    if (obj && obj.company && typeof obj.company.$oid !== 'undefined') {
        return obj.company.$oid;
    } else {
        return null;
    }
}

function processVehicles(inJson) {
    // Parse the input JSON string
    var obj = JSON.parse(inJson);
    
    // Extract company ID if present
    obj.company_id = extractCompanyId(obj);
  
    // Convert the modified JSON object back to string
    return JSON.stringify(obj);
}

View solution in original post

3 REPLIES 3

By the way, I don't create a schema in the BigQuery table. But I think it still doesn't really help...

I have the data in the MongoDB in the following format:
```
{
"_id": "661405f6d08f928ee5c1923c",
"company": "65e73e48019b0576d763c021",
"color": "black",
"deleted": false,
"createdAt": "2024-04-08T14:57:58.965Z",
"updatedAt": "2024-04-09T08:29:44.743Z",
}
```
Where company is a foreign key.

In the BigQuery I get it like this:
```

{
"id": "661405f6d08f928ee5c1923c",
"source_data": "{\"_id\":{\"timestamp\":1712588278,\"counter\":12685884,\"randomValue1\":13668242,\"randomValue2\":-28955},\"company\":{\"timestamp\":1709653576,\"counter\":6537249,\"randomValue1\":105221,\"randomValue2\":30423},\"color\":\"black\","createdAt\":\"Apr 8, 2024, 2:57:58 PM\",\"updatedAt\":\"Apr 9, 2024, 8:29:44 AM\"}",
"timestamp": "2024-04-09 10:54:50.151000 UTC",
}

```

UPD: Dataflow MongoDB to BigQuery doesn't parse columns with type ObjectId. I fixed it by adding a UDF function, which parses data from MongoDB. Here is the function:

function extractCompanyId(obj) {
    if (obj && obj.company && typeof obj.company.$oid !== 'undefined') {
        return obj.company.$oid;
    } else {
        return null;
    }
}

function processVehicles(inJson) {
    // Parse the input JSON string
    var obj = JSON.parse(inJson);
    
    // Extract company ID if present
    obj.company_id = extractCompanyId(obj);
  
    // Convert the modified JSON object back to string
    return JSON.stringify(obj);
}

Hi @happyhooter the issue you're describing might be related to how the MongoDB to BigQuery template handles data. ObjectId fields in MongoDB don’t always convert automatically into a readable format in BigQuery, which can cause certain values to not transfer correctly.

Possible Solutions:

🔹 Convert ObjectId to String before transferring – In MongoDB, you can add an extra field that stores the ObjectId as a string to ensure proper conversion.

🔹 Preprocess data in Dataflow – If you're using Apache Beam in Dataflow, you can apply a transformation to convert ObjectId fields before writing to BigQuery.

🔹 Check your BigQuery schema – Ensure that the corresponding column in BigQuery is set to the correct data type to handle the incoming values properly.

A Simpler Alternative

If you want a more flexible solution without dealing with these manual adjustments, tools like Windsor.ai can help seamlessly integrate MongoDB with BigQuery, handling compatibility issues and transformations automatically.

Hope this helps!