Dataflow MongoDB to BigQuery template: Missing foreign keys in BigQuery table

Hello, I'm new to Dataflow and have created a pipeline to transfer data from a MongoDB collection to a BigQuery table using MongoDB to BigQuery template. However, I've encountered an issue where data with a foreign key referencing another table in the MongoDB database (of type ObjectId) is missing in the BigQuery table. Can you please explain why this is happening and suggest a solution to fix it?

Solved Solved
2 2 108
1 ACCEPTED SOLUTION

UPD: Dataflow MongoDB to BigQuery doesn't parse columns with type ObjectId. I fixed it by adding a UDF function, which parses data from MongoDB. Here is the function:

function extractCompanyId(obj) {
    if (obj && obj.company && typeof obj.company.$oid !== 'undefined') {
        return obj.company.$oid;
    } else {
        return null;
    }
}

function processVehicles(inJson) {
    // Parse the input JSON string
    var obj = JSON.parse(inJson);
    
    // Extract company ID if present
    obj.company_id = extractCompanyId(obj);
  
    // Convert the modified JSON object back to string
    return JSON.stringify(obj);
}

View solution in original post

2 REPLIES 2

By the way, I don't create a schema in the BigQuery table. But I think it still doesn't really help...

I have the data in the MongoDB in the following format:
```
{
"_id": "661405f6d08f928ee5c1923c",
"company": "65e73e48019b0576d763c021",
"color": "black",
"deleted": false,
"createdAt": "2024-04-08T14:57:58.965Z",
"updatedAt": "2024-04-09T08:29:44.743Z",
}
```
Where company is a foreign key.

In the BigQuery I get it like this:
```

{
"id": "661405f6d08f928ee5c1923c",
"source_data": "{\"_id\":{\"timestamp\":1712588278,\"counter\":12685884,\"randomValue1\":13668242,\"randomValue2\":-28955},\"company\":{\"timestamp\":1709653576,\"counter\":6537249,\"randomValue1\":105221,\"randomValue2\":30423},\"color\":\"black\","createdAt\":\"Apr 8, 2024, 2:57:58 PM\",\"updatedAt\":\"Apr 9, 2024, 8:29:44 AM\"}",
"timestamp": "2024-04-09 10:54:50.151000 UTC",
}

```

UPD: Dataflow MongoDB to BigQuery doesn't parse columns with type ObjectId. I fixed it by adding a UDF function, which parses data from MongoDB. Here is the function:

function extractCompanyId(obj) {
    if (obj && obj.company && typeof obj.company.$oid !== 'undefined') {
        return obj.company.$oid;
    } else {
        return null;
    }
}

function processVehicles(inJson) {
    // Parse the input JSON string
    var obj = JSON.parse(inJson);
    
    // Extract company ID if present
    obj.company_id = extractCompanyId(obj);
  
    // Convert the modified JSON object back to string
    return JSON.stringify(obj);
}