Solved: ERROR : INVALID_ARGUMENT jobInternalError on Kafka...

SebH · 08-10-2023 06:03 AM

Hi

We use Kafka BigQuerySinkConnector to update partitioned tables in a dataset,
From time to time, we have an error on Google Bigquery side and the connector fail

{
  "code" : 400,
  "errors" : [ {
    "domain" : "global",
    "message" : "The job encountered an internal error during execution and was unable to complete successfully.",
    "reason" : "jobInternalError"
  } ],
  "message" : "The job encountered an internal error during execution and was unable to complete successfully.",
  "status" : "INVALID_ARGUMENT"
}

On GCP side we have no information about this error

After the connector restart it work again, but do you have any hint why this error happen ?

Thanks

ms4446

The jobInternalError is a broad error message, and it can be triggered by various underlying issues. If restarting the connector resolves the problem, it suggests that the issue might be transient or related to the connector's state or environment at that particular time. Here are some potential reasons for this error:

Resource Constraints: The connector might have run into resource constraints, either in terms of CPU, memory, or network. Restarting the connector would free up and reallocate these resources.
Connector State: Connectors maintain an internal state, and sometimes this state can become corrupted or inconsistent, leading to errors. Restarting the connector resets its internal state.
Throttling or Quota Exceedance: If the connector is making too many requests in a short period, it might hit BigQuery's rate limits or quotas. After a pause (like a restart), the quotas might reset or the request rate might drop, allowing the connector to function again.
Temporary Network Issues: There could have been a brief network disruption between your Kafka cluster and BigQuery, causing the connector to fail. The network might have stabilized by the time you restarted the connector.
Data Issues: Sometimes, a particular batch of data might cause issues (e.g., schema mismatch, corrupted data). If the connector processes data in batches and moves to the next batch after a restart, it might bypass the problematic data.
Concurrent Modifications: If there are other processes or jobs modifying the BigQuery table or dataset at the same time the connector is trying to write data, it might lead to conflicts or errors.

To get a clearer picture of why the error occurred, you can:

Check Logs: Examine the logs of the Kafka BigQuerySinkConnector for any warnings, errors, or unusual messages leading up to the failure.
Monitor Resources: Use monitoring tools to check the resource usage of the connector around the time of the error.
GCP Operations Suite: Look for any related logs or metrics in GCP's monitoring and logging tools to see if there were any issues on the GCP side.

View solution in original post

SebH

We had a reply from Google Cloud Support :
The BigQuery Engineering Team shared that your failed job was affected by a spike in traffic which caused system overload in a short period of time.
With this, they have suggested you rerun the job and advise if the same error still appears at your end.

So the only way to keep running the connector instead of failing when this error happen is to modify the kafka connector code to retry job when this error happen

View solution in original post

ms4446

The jobInternalError is a broad error message, and it can be triggered by various underlying issues. If restarting the connector resolves the problem, it suggests that the issue might be transient or related to the connector's state or environment at that particular time. Here are some potential reasons for this error:

Resource Constraints: The connector might have run into resource constraints, either in terms of CPU, memory, or network. Restarting the connector would free up and reallocate these resources.
Connector State: Connectors maintain an internal state, and sometimes this state can become corrupted or inconsistent, leading to errors. Restarting the connector resets its internal state.
Throttling or Quota Exceedance: If the connector is making too many requests in a short period, it might hit BigQuery's rate limits or quotas. After a pause (like a restart), the quotas might reset or the request rate might drop, allowing the connector to function again.
Temporary Network Issues: There could have been a brief network disruption between your Kafka cluster and BigQuery, causing the connector to fail. The network might have stabilized by the time you restarted the connector.
Data Issues: Sometimes, a particular batch of data might cause issues (e.g., schema mismatch, corrupted data). If the connector processes data in batches and moves to the next batch after a restart, it might bypass the problematic data.
Concurrent Modifications: If there are other processes or jobs modifying the BigQuery table or dataset at the same time the connector is trying to write data, it might lead to conflicts or errors.

To get a clearer picture of why the error occurred, you can:

Check Logs: Examine the logs of the Kafka BigQuerySinkConnector for any warnings, errors, or unusual messages leading up to the failure.
Monitor Resources: Use monitoring tools to check the resource usage of the connector around the time of the error.
GCP Operations Suite: Look for any related logs or metrics in GCP's monitoring and logging tools to see if there were any issues on the GCP side.

SebH

Thanks for your detailled answer !

Yes, that's the problem with this error, it's a broad error and it's hard to know what happened. The Bigquery connector does not seems to catch this error as a Retryable error so it automatically fail and require manual restart

I don't think it's Data issue since the connector should not skip data, but it could be network issues, we already had problem with another connector because of network error, we fixed it but I think we can investigate more on this.

Check Logs: We checked log already and we have no more information except this BigQueryException, and it the other request that were running at the same time (since batch merge happen every 5 mins) didn't failed

Monitor Resources: We do not see any special resource usage during this problem with our monitoring system, we will continue to monitor it maybe we will find something

GCP Operations Suite: Seems that something is missing to use this monitoring tool since I don't see much logs in logs explorer I will check with my team

On the Other hand someone told me that we took the Cloud Support service since few month so I can also check with google assistance if they have more detail on their side

ms4446

It sounds like you've taken a comprehensive approach to troubleshooting the issue. Given the steps you've already taken, reaching out to Google Cloud Support is a good idea.

SebH

We had a reply from Google Cloud Support :
The BigQuery Engineering Team shared that your failed job was affected by a spike in traffic which caused system overload in a short period of time.
With this, they have suggested you rerun the job and advise if the same error still appears at your end.

So the only way to keep running the connector instead of failing when this error happen is to modify the kafka connector code to retry job when this error happen

ERROR : INVALID_ARGUMENT jobInternalError on Kafka BigQuerySinkConnector