I have an hourly dbt job running. Last night, between 3am-4am, the job started failing and 9 files/models started producing errors.
My dbt workflow ingests JSON data from MongoDB and saves the output to Google BigQuery tables. Last night, suddenly, the code in these 9 files started producing the error: Column of type JSON cannot be used in SELECT DISTINCT.
My code in these 9 files takes on the form:
SELECT DISTINCT col1, col2, json_col3 …
There are many JSON type columns in my data and I only unnest some of them. This worked previously without issue in 9 files, for months. Now, there is a problem. Was there a change to Google BigQuery that no longer allows SELECT DISTINCT with JSON columns? I don’t see any mention of such a change in the release notes, nor any release notes for today.
It's always tricky when things that worked suddenly start failing! BigQuery indeed doesn't support using JSON columns directly within a SELECT DISTINCT statement. However, it seems like this was working for you previously and only started failing recently.
The issue might be related to changes in the data or its structure. It's possible that alterations in the data within those JSON columns occurred between 3 am and 4 am last night, leading to complications when the SELECT DISTINCT operation was applied.
One possible reason could be changes in the data itself within the JSON columns that might have introduced duplicate values when the SELECT DISTINCT operation was applied. If previously the JSON structures within these columns didn't have duplicate content in the selected fields, it might have seemed to work fine. But a recent change could have introduced duplicate values or complex JSON structures that aren't compatible with the SELECT DISTINCT operation.
My suggestion would be to inspect the data within these JSON columns during the time when the job started failing. Look for any alterations or duplicates in the data that might be causing the issue. Additionally, review any recent changes made to the JSON data ingestion or the dbt models to see if they might have affected the way these JSON columns are handled within the SELECT DISTINCT operation.