Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Failed load job, how to get the source json with the original request

I have some LOAD job in Google Cloud Platform, with an error like:

"Error while reading data, error message: JSON parsing error in row starting at position 0: No such field..."

So... since I got many failed jobs, I'd like to get that JSON that was unable get parsed, so I could fix it and upload again the job.

Is it possible to get that "source" json with all the data that was uploaded to the job?

Cause I'm only able to get the job info, but not the original json with errors.

Thanks

I'd tried to get the job original failing json, but I only got the job json description/result. I need to get the source, so I could fix the json and re-upload the job.

Context:
The user enters data on the website, and then that is inserted into BiqQuery using a load job.
Since the job failed, the user's inputed data is lost. So I need to recover (if possible) the json with the user's data so I could fix it and then upload the json again, avoiding the user to manually re-enter all the data.

0 3 2,867
3 REPLIES 3

It's critical to start by examining the error message within the BigQuery job details. The error message you mentioned, "JSON parsing error in row starting at position 0: No such field...", provides key insights:

  • Position 0: This indicates that the issue likely resides in the first line of your JSON input, suggesting an immediate discrepancy at the beginning of your data file.
  • No such field: This error points to a field in your JSON data that does not exist in your target BigQuery table schema, indicating a schema mismatch or an unexpected field in your JSON.

Depending on your LOAD job setup and the origin of the JSON, different strategies apply:

  • Scenario A: JSON from Cloud Storage

    • Error Configuration: If your LOAD job was configured to generate error files, BigQuery might have written the problematic data into a separate file in Cloud Storage. Review your job configuration for any settings related to error handling.
    • Recent Input: If the JSON file in Cloud Storage is your most recent upload, consider downloading it or extracting a portion for inspection. Always make a backup before making any changes, especially if subsequent jobs might overwrite the original file.
  • Scenario B: JSON Data Inlined or from Other Sources

    • Web Input: If users enter data on your website, and it's converted into JSON for BigQuery, check if this user-generated JSON is temporarily stored anywhere (e.g., logs, a temporary database) before the LOAD job is triggered. This could be a valuable source for retrieval.
    • Other Sources: Consider the entire data pipeline. If the JSON is generated on-the-fly by a script or application, ensure to log the generated JSON before sending it to BigQuery for future reference.

Troubleshooting and Prevention

  • Schema Inspection: Carefully compare the field names and data types in your JSON against your BigQuery table schema. Any discrepancies can lead to errors.
  • JSON Validation: Prior to future LOAD jobs, validate your JSON against the BigQuery schema to identify formatting and structural issues beforehand. Tools and libraries are available for this purpose.
  • Data Type Casting: Implement necessary conversions if the JSON field types do not match the BigQuery field types directly, paying close attention to data types that are often represented differently (e.g., timestamps).

Caveats

  • BigQuery does not offer a straightforward way to "return" the original failing JSON from a LOAD job. The strategies mentioned rely on having intermediate storage or logging as part of your data loading process, underscoring the importance of such mechanisms for data integrity and troubleshooting.

Additional Consideration

  • Incremental Approach: For handling large datasets, consider processing and loading the data in smaller batches. This strategy can simplify the process of identifying and isolating errors, allowing for quicker resolution.

Hi, 

Thank you for your detailed response.

This is what I'm trying to figure out:

Web Input: If users enter data on your website, and it's converted into JSON for BigQuery, check if this user-generated JSON is temporarily stored anywhere (e.g., logs, a temporary database) before the LOAD job is triggered. This could be a valuable source for retrieval."

 

I guess those json must be somewhere... but I'm not able to find those.

That's mainly what I was asking, where/how to get to those files.

Thanks !

Unfortunately, BigQuery doesn't automatically store the complete source JSON of failed LOAD jobs. The error message "JSON parsing error in row starting at position 0: No such field..." suggests an immediate mismatch between your JSON's structure and your target BigQuery table's schema. We need to try recovering that JSON if possible.

Potential Recovery Strategies

  1. Cloud Storage:

    • Source File: If your LOAD job directly targets a JSON file within Cloud Storage, carefully examine that bucket. Could the original file have been overwritten by later jobs? Is there file versioning enabled? Look for recently updated or overwritten files aligned with the time of your failed jobs.
    • Error Files: Did you configure your LOAD jobs to write problematic rows to error files in Cloud Storage? If so, those files might hold the specific JSON causing the failures. Investigate your LOAD job settings.
  2. Web Input & Application Logic:

    • Web Server Logs: While unlikely to contain full request data by default, examine your web server's access and error logs. Some custom setups might capture request bodies. Look for patterns relating to failed jobs around their timestamps.
    • Temporary Storage: Is there any chance your application briefly stores the user-generated JSON before initiating the BigQuery process? Search your codebase for any form of temporary storage, even as simple as in-memory variables.

Essential Tip: Examine the exact JSON parsing error message carefully. It often provides clues about the field that's mismatched or missing, narrowing down your search.

Prevention for the Future

To avoid this painful issue again, implement safeguards:

  • Intentional Logging: Before sending data to BigQuery, explicitly log generated JSON structures with timestamps and user IDs. This provides you with an independent source to recover from errors.
  • JSON Validation: Before running the LOAD job, have your application validate the JSON against your known BigQuery schema, and fix any mismatches beforehand.
  • Robust Monitoring: If feasible, put in place monitoring on your BigQuery LOAD jobs to alert you quickly when they fail. This will prevent potential overwrites and help identify problem patterns sooner.