Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

How to find URIs which causing the load job failure

Hi,

I am loading data into a big query table as follows

async loadDataToTable(project_id, dataset_id, table_id, uris, source_format, schema) {
const result = new CustomResponse().result;
try {
const options = {
sourceUris: uris,
sourceFormat: source_format,
schema: schema,
location: this.bigquery_location,
ignoreUnknownValues: true,
destinationTable: {
datasetId: dataset_id,
projectId: project_id,
tableId: table_id,
},
};

const [job] = await this.connection.createJob({ configuration: { load: options } });
logInfo(
`Job ${job.id} started ${job.metadata.status.state} for dataset ${dataset_id} for table ${table_id}`,
this.pipe_id,
Events.MONGODB_QUERY,
);
const [metadata] = await job.promise();

// Wait for job to complete
const status = metadata.status;
if (status.state === "DONE") {
logInfo(
`Job ${job.id} done for dataset ${dataset_id} for table ${table_id}`,
this.pipe_id,
Events.MONGODB_QUERY,
);
result.data = {
rows_processed:
metadata.statistics && metadata.statistics.load && metadata.statistics.load.outputRows
? Number(metadata.statistics.load.outputRows)
: 0,
};
} else {
logInfo(
`Job current status: ${status.state} for dataset ${dataset_id} for table ${table_id}`,
this.pipe_id,
Events.MONGODB_QUERY,
);
if (status.errors) {
result.error_occurred = true;
result.data = { rows_processed: 0 };
result.error_info = {
error: JSON.stringify(status.errors),
error_source: "BigqueryService.loadDataToTable",
error_description: `Error occurred while doing load job at BigQuery table ${table_id} in dataset ${dataset_id} for pipe ID: ${this.pipe_id}`,
};
return result;
}
}
} catch (e) {
result.error_occurred = true;
result.data = { rows_processed: 0 };
result.error_info = {
error: e.errors ? JSON.stringify(e.errors) : `${e.name}: ${e.message}`,
error_stack: e.stack,
error_source: "BigqueryService.loadDataToTable",
error_description: `Error occurred while loading data into BigQuery table ${table_id} in dataset ${dataset_id} for pipe ID: ${this.pipe_id}`,
};
}
return result;
}

the problem I am facing is, suppose I have 10 URIs to upload, and like 2 or 3 URIs are having the problem due to which the whole load job is failing, I want to find out which URIs are causing the issue,

I have tried to get the metadata of the job but that does not contain the URIs 

I would really appreciate it if someone can tell me the way to do it

1 3 426
3 REPLIES 3

To effectively pinpoint and resolve issues causing BigQuery load job failures, adopting a detailed and strategic approach is crucial. Below is a strategy that integrates best practices and addresses potential challenges more effectively:

Utilize Detailed Audit Logs

  • Enable Cloud Audit Logging: Captures in-depth information about your load jobs.
  • Isolate Relevant Logs: Use filters in Cloud Logging like:
    protoPayload.resourceName="projects/PROJECT_ID/jobs/JOB_ID" 
    protoPayload.methodName="google.cloud.bigquery.v2.JobService.InsertJob" 
    
  • Locate Errors: Logs may not always pinpoint the exact problematic URI; understanding this limitation is important.

Enhance Error Handling in Your Code

  • Improve Error Capture: Log detailed error information within your loadDataToTable function. This context is crucial, even if BigQuery error objects don't include the problematic URI directly.
  • Consider a Mapping Strategy: A temporary mapping between input URIs and unique identifiers can streamline troubleshooting.

Pre-Validation as Prevention

  • Proactive Data Checks: Validate file format, delimiters, and encoding before loading.
  • Tools for Validation: Use tools or scripts to check format and schema compatibility.

Address Common Errors Strategically

  • Invalid URIs: Verify existence, accessibility, and correct formatting.
  • File Format and Schema Mismatch: Ensure the source_format and provided schema accurately represent your data.
  • Data Quality: Address issues like incorrect encoding, inconsistent delimiters, or values incompatible with the target schema.

Iterative Debugging and Resources

  • Isolation: Troubleshoot smaller batches of URIs to pinpoint issues.
  • Data Transformation: Use tools like Dataflow or Dataprep for complex cleaning and reformatting.
  • Leverage BigQuery Community and Documentation: Find specific guidance and solutions.

Example Error Messages

  • "Not Found: URI gs://my-bucket/nonexistent-file.csv"
  • "Unable to parse JSON data"
  • "CSV parse error: Wrong number of columns at line 23"
  • "Could not parse 'abc123' as INT64 for field 'customer_id'"

Hi @ms4446  thank you for your quick response 

the error I am getting is this 

ApiError: Invalid timestamp value: -62164498757000000
at new ApiError (/Users/john/work/t1/load/node_modules/@google-cloud/common/build/src/util.js:75:15)
at /Users/john/work/t1/load/node_modules/@google-cloud/bigquery/build/src/job.js:417:23
at/Users/john/work/t1/load/node_modules/@google-cloud/common/build/src/service-object.js:204:13
at/Users/john/work/t1/load/node_modules/@google-cloud/common/build/src/util.js:412:25
at Util.handleResp (/Users/john/work/t1/load/node_modules/@google-cloud/common/build/src/util.js:161:9)
at/Users/john/work/t1/load/node_modules/@google-cloud/common/build/src/util.js:534:22
at onResponse (/Users/john/work/t1/load/node_modules/retry-request/index.js:248:7)
at/Users/john/work/t1/load/node_modules/teeny-request/build/src/index.js:226:13
at processTicksAndRejections (node:internal/process/task_queues:96:5) {
code: undefined,
errors: [
{
reason: 'invalidQuery',
location: 'query',
message: 'Invalid timestamp value: -62164498757000000'
}
],
response: undefined
}
where to find from this error

The error message "ApiError: Invalid timestamp value: -62164498757000000" indicates a problem with a timestamp value in your BigQuery job. This error sheds light on two key areas:

  • Timestamp Value: The specified value -62164498757000000 is significantly outside the acceptable range for timestamps in BigQuery, which supports years from 0001 to 9999. This particular value often signifies an underflow or an uninitialized date variable in computing, representing a date far in the past.

  • Error Location: The mention of "location: 'query'" in the error suggests the issue might be within your BigQuery SQL query or the way timestamps are handled or provided in your code.

Where to Investigate

  • Query Analysis: Examine your BigQuery SQL query for potential issues:

    • Calculations: Look for date or timestamp calculations that might produce out-of-range values.

    • Hardcoded Timestamps: Check for any hardcoded timestamp values that may be incorrect.

    • Function Usage: Verify the correct usage of BigQuery timestamp functions, referring to BigQuery documentation on timestamp functions for guidance.

  • Data Preparation: If loading data into BigQuery, review how timestamp columns are generated or formatted:

    • Timestamp Generation: Ensure any code-generated timestamps are within the valid range and logic for calculation is correct.

    • Data Source Integrity: For data imported from other systems, confirm the format and integrity of timestamp data in the source.

Debugging Tips

  • Inspect Input Data: Directly inspect rows of data involved in the query or load process to identify any with erroneous timestamps. This can help narrow down the source of the issue.

  • Isolate the Issue: Break down complex queries into smaller parts to pinpoint where the timestamp manipulation might be going awry.

  • Review Logs: Utilize logs for additional context around the error. BigQuery's job information and query history can also provide insights into the issue.

Additional Considerations

  • Ensure that any conversions between time zones are correctly handled, as this can also lead to unexpected timestamp values.

  •  If the data comes from external sources, verify the export or transformation processes to ensure timestamp integrity before import.

  • For a deeper understanding, consult BigQuery's documentation on handling dates and times, including limitations and best practices for timestamp data.