Re: BQ Table partition data got deleted fully even...

soumiknow · 12-12-2024 01:26 AM

We have a date (DD/MM/YYYY) partitioned BQ table. We want to update a specific partition data in 'overwrite' mode using PySpark. So to do this, I applied 'spark.sql.sources.partitionOverwriteMode' to 'DYNAMIC' at session level as per the spark bq connector documentation. But still it deleted the other partitioned data which should not be happening.

df_with_partition.write.format("bigquery") \
                .option("table", f"{bq_table_full}") \
                .option("partitionField", f"{partition_date}") \
                .option("partitionType", f"{bq_partition_type}") \
                .option("temporaryGcsBucket", f"{temp_gcs_bucket}") \
                .option("spark.sql.sources.partitionOverwriteMode", "DYNAMIC") \
                .option("writeMethod", "indirect") \
                .mode("overwrite") \
                .save()

Can anyone please suggest me what I am doing wrong or how to implement this dynamic partitionOverwriteMode. Many thanks.

nikacalupas

Hi @soumiknow,

Welcome to the Google Cloud Community!

You’re on the right track with using ‘partitionOverwriteMode’ set to ‘DYNAMIC,’ but the problem you are encountering is the incorrect placement of your method where you’re setting it. You might configure it at the Spark session level, not as part of the write options.

To address your question, here are potential ways that might help with your use case:

Utilizing Spark Session Configuration: Ensure ‘spark.sql.sources.partitionOverwriteMode’ is set at the Spark session level. The important part is to define it before performing the write, not within the ‘.write()’ method.
Write operation: The ‘.mode("overwrite")’ ensures that Spark overwrites data only for the specified partitions. Setting ‘partitionOverwriteMode’ to ‘DYNAMIC’ ensures that only the data in the specific partition(s) you're writing to will be replaced.
Review your BigQuery Logs: In case of a write failure, you might review the BigQuery job logs in the Google Cloud Console for detailed error messages.
Review your Dataset: Before executing the write, you might check the partitions available in the BigQuery table and verify which partition's data was overwritten during the operation.

You may refer to this documentation, which might help you understand how to implement BigQuery connector with Spark session configuration.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

BQ Table partition data got deleted fully even though 'spark.sql.sources.partitionOverwriteMode' is