Hello, I'm new to Dataflow and have created a pipeline to transfer data from a MongoDB collection to a BigQuery table using MongoDB to BigQuery template. However, I've encountered an issue where data with a foreign key referencing another table in the MongoDB database (of type ObjectId) is missing in the BigQuery table. Can you please explain why this is happening and suggest a solution to fix it?
Solved! Go to Solution.
UPD: Dataflow MongoDB to BigQuery doesn't parse columns with type ObjectId. I fixed it by adding a UDF function, which parses data from MongoDB. Here is the function:
function extractCompanyId(obj) {
if (obj && obj.company && typeof obj.company.$oid !== 'undefined') {
return obj.company.$oid;
} else {
return null;
}
}
function processVehicles(inJson) {
// Parse the input JSON string
var obj = JSON.parse(inJson);
// Extract company ID if present
obj.company_id = extractCompanyId(obj);
// Convert the modified JSON object back to string
return JSON.stringify(obj);
}
By the way, I don't create a schema in the BigQuery table. But I think it still doesn't really help...
I have the data in the MongoDB in the following format:
```
{
"_id": "661405f6d08f928ee5c1923c",
"company": "65e73e48019b0576d763c021",
"color": "black",
"deleted": false,
"createdAt": "2024-04-08T14:57:58.965Z",
"updatedAt": "2024-04-09T08:29:44.743Z",
}
```
Where company is a foreign key.
In the BigQuery I get it like this:
```
```
UPD: Dataflow MongoDB to BigQuery doesn't parse columns with type ObjectId. I fixed it by adding a UDF function, which parses data from MongoDB. Here is the function:
function extractCompanyId(obj) {
if (obj && obj.company && typeof obj.company.$oid !== 'undefined') {
return obj.company.$oid;
} else {
return null;
}
}
function processVehicles(inJson) {
// Parse the input JSON string
var obj = JSON.parse(inJson);
// Extract company ID if present
obj.company_id = extractCompanyId(obj);
// Convert the modified JSON object back to string
return JSON.stringify(obj);
}
Hi @happyhooter the issue you're describing might be related to how the MongoDB to BigQuery template handles data. ObjectId fields in MongoDB don’t always convert automatically into a readable format in BigQuery, which can cause certain values to not transfer correctly.
Possible Solutions:
🔹 Convert ObjectId to String before transferring – In MongoDB, you can add an extra field that stores the ObjectId as a string to ensure proper conversion.
🔹 Preprocess data in Dataflow – If you're using Apache Beam in Dataflow, you can apply a transformation to convert ObjectId fields before writing to BigQuery.
🔹 Check your BigQuery schema – Ensure that the corresponding column in BigQuery is set to the correct data type to handle the incoming values properly.
A Simpler Alternative
If you want a more flexible solution without dealing with these manual adjustments, tools like Windsor.ai can help seamlessly integrate MongoDB with BigQuery, handling compatibility issues and transformations automatically.
Hope this helps!