Solved: Recover from failure or partial data load from spa...

waseblr · 12-29-2023 11:45 AM

Hi,

How to recover partially loaded data from Spark dataframe to GCP BigQuery table in case of failures.

Ex: Suppose I have 1 million records in a spark dataframe and 50% of the records are loaded into BigQuery table and job fails.

Before I restart the job, I want to remove the partially loaded data from table or bring the table data into the original stage before the data load (can I use Time Travel to do that?).

Need to handle this case in PySpark automatically.

Basically, either load all the records from dataframe or don't load anything incase of failures or commit records only when all records are inserted successfully.

Thanks and much appreciate your help.

kolban

If the Spark job is the ONLY job appending to the BigQuery table then you can indeed use time travel to return the table to a given moment in time (I.e. before the Spark job started). Thus if you re-ran the spark job after "rolling-back" the changes, you would have idempotency. Another thought might be to have your Spark job manifest the data frame as a Google Cloud Storage object in a bucket and then perform a BigQuery LOAD of the object into the table. That is a transactional activity (its all or nothing).

View solution in original post

kolban

If the Spark job is the ONLY job appending to the BigQuery table then you can indeed use time travel to return the table to a given moment in time (I.e. before the Spark job started). Thus if you re-ran the spark job after "rolling-back" the changes, you would have idempotency. Another thought might be to have your Spark job manifest the data frame as a Google Cloud Storage object in a bucket and then perform a BigQuery LOAD of the object into the table. That is a transactional activity (its all or nothing).

Danieldon01

That's good 🙂, but can I use my android phone to do that also? Hope it works?

Recover from failure or partial data load from spark dataframe to GCP BigQuery table