Hive Partition Load - Unable to set a field mode t...

txu77 · 08-14-2024 12:14 PM

I am attempting to load externally partitioned data into a BigQuery table using Python. The BQ table in question has one REQUIRED field, which corresponds to the custom partition key schema in the GCS uri.

For example, my custom uri which I use as the `source_uri_prefix` looks like this:

gs://my_bucket/my_table/{dt:DATE}

And my table schema is like this:

[
  bigquery.SchemaField("field_a", "STRING"),
  bigquery.SchemaField("field_b", "STRING"),
  bigquery.SchemaField("dt", "DATE", mode="REQUIRED"),
]

However, whenever I attempt to load I get this error:

ERROR - 400 Provided Schema does not match Table project:dataset.table$YYYYMMDD. Field dt has changed mode from REQUIRED to NULLABLE

Is there a way to set the custom keys (in this case `dt`) mode in the source_uri_prefix to REQUIRED? Or is it just a given that any custom key in the source_uri_prefix is REQUIRED b/c it is automatically added during the load?

ms4446

When using source_uri_prefix in BigQuery to load externally partitioned data, BigQuery automatically extracts the partition key from the specified path and adds it to the table. The partition key (in this case, dt) is considered REQUIRED by BigQuery when loading data because it must be present in the URI for the load operation to succeed.

However, if you're encountering an error, it might be due to other reasons, such as mismatched schema definitions, data type issues, or the way the source_uri_prefix is structured.

Here are a few things to check and ensure:

URI and Table Schema Alignment: Make sure that the date format in the source_uri_prefix exactly matches the expected format in the BigQuery table schema. If the date format or the naming convention differs, it could cause the load operation to fail.
Data in GCS: Ensure that the data files in GCS have the required fields (in this case, dt field) and that the values align with the format expected by BigQuery.
Table Configuration: Ensure that the table is correctly configured to accept externally partitioned data, and that the partition key (dt) is defined correctly in the schema.
Loading Method: Make sure you are using the appropriate method for loading externally partitioned data. For example, using bigquery.LoadJobConfig and setting the source_format appropriately (e.g., PARQUET, CSV).

txu77

Hi, thanks for the response!

Regarding the points you mentioned to check and ensure during load:

In the date format is `DATE` in the source_uri_prefix and is also `DATE` in the schema. I don't think there is an issue there
The data files in GCS cannot have `dt` in the source files, otherwise I get an error like below:
Show More
ERROR - 400 Error while reading data, error message: Failed to add partition key dt (type: TYPE_DATE) to schema, because another column with the same name was already present. This is not allowed.
The files for example have `field_1` and `field_2` while `dt` is specified at the source_uri_prefix level, not in the source files themselves. Please let me know if I misunderstood your point here
The partition key is included in the schema already
I am loading CSV, and it is set in the LoadJobConfig as `bigquery.SourceFormat.CSV` so I don't think there is an issue there

Is there something that I'm missing or should I set `dt` to have a NULLABLE mode in the schema file?

Hive Partition Load - Unable to set a field mode to REQUIRED during load from GCS