Hi all, I need a support or suggestion from everyone,
I 'm using library google-cloud-notebooks==1.7.0. The under is my example code to create and health check a execution.
from google.cloud.notebooks_v1 import CreateExecutionRequest, GetExecutionRequest
from google.cloud.notebooks_v1.services.notebook_service import NotebookServiceClient
# Create client
client = NotebookServiceClient(credentials=credential)
# Create request template
request_create_execution = CreateExecutionRequest(
parent=PARENT,
execution_id=f"trigger_vertex_notebook_{uuid.uuid4().hex}",
execution=EXECUTION_TEMPLATE,
)
# Create a execution
operation = client.create_execution(request=request_create_execution, timeout=120)
operation_result = operation.result()
# Create template
request_get_execution = GetExecutionRequest(name=operation_result.name)
while True:
execution_status = client.get_execution(request=request_get_execution)
if execution_status.state == Execution.State.SUCCEEDED:
break
elif execution_status.state == Execution.State.FAILED:
raise RuntimeError("Execution failed")
time.sleep(60)
I create a DAG on Airflow for schedule job. I got a error. This error is not common, about 1 error out of 5s runs. The under is log of this error.
Traceback (most recent call last):
File "/opt/python3.8/lib/python3.8/site-packages/google/api_core/grpc_helpers.py", line 72, in error_remapped_callable
return callable_(*args, **kwargs)
File "/opt/python3.8/lib/python3.8/site-packages/grpc/_channel.py", line 1030, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/opt/python3.8/lib/python3.8/site-packages/grpc/_channel.py", line 910, in _end_unary_response_blocking
raise _InactiveRpcError(state) # pytype: disable=not-instantiable
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.INTERNAL
details = "An internal error has occurred (72adca73-89c5-4634-9d3a-0644405e1e64)"
debug_error_string = "UNKNOWN:Error received from peer ipv4:142.250.75.10:443 {created_time:"2023-08-05T05:33:36.591466057+00:00", grpc_status:13, grpc_message:"An internal error has occurred (72adca73-89c5-4634-9d3a-0644405e1e64)"}"
>
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/airflow/gcs/dags/trigger_vertex_notebook.py", line 88, in trigger
execution_status = client.get_execution(request=request_get_execution, timeout=timeout)
File "/opt/python3.8/lib/python3.8/site-packages/google/cloud/notebooks_v1/services/notebook_service/client.py", line 3970, in get_execution
response = rpc(
File "/opt/python3.8/lib/python3.8/site-packages/google/api_core/gapic_v1/method.py", line 113, in __call__
return wrapped_func(*args, **kwargs)
File "/opt/python3.8/lib/python3.8/site-packages/google/api_core/timeout.py", line 120, in func_with_timeout
return func(*args, **kwargs)
File "/opt/python3.8/lib/python3.8/site-packages/google/api_core/grpc_helpers.py", line 74, in error_remapped_callable
raise exceptions.from_grpc_error(exc) from exc
google.api_core.exceptions.InternalServerError: 500 An internal error has occurred (72adca73-89c5-4634-9d3a-0644405e1e64
Can you give me some suggestion to investigate the issue, here?
Hi @ltduong,
The error messages that you are receiving all points to an internal error but with limited information about it. This kind of error usually indicates that there is something going on in the Google Cloud Server, it may be caused by a hardware failure, a software bug or a network issue.
You can also wait a little then try again as the error might get resolved on its own. Sometimes, internal errors are temporary and will go away eventually.
Here are some usable resources that might be of help.
Hope this helps.
Hi @lsolatorio ,
Thank you for your support,
I have some extra information, here.
- When we got this error in Airflow, the executor on Vertex AI kept running without any problems until finished. So, I guess that is a network issue.
- I tried to increase the timeout for a request. And try to retry and sleep (1, 2, 3, 4, 5 minutes) again, but I still have this problem.
MAX_NUM_RETRY = 5
retry = 0
while True:
try:
timeout = 600
execution_status = client.get_execution(request=request_get_execution, timeout=timeout)
if execution_status.state == Execution.State.SUCCEEDED:
break
elif execution_status.state == Execution.State.FAILED:
raise RuntimeError("Execution failed")
time.sleep(5)
except Exception as e:
retry += 1
if retry >= MAX_NUM_RETRY:
raise RuntimeError(f"Execution failed with too many retries: {retry}")
time.sleep(60*retry)
Do you have any suggestions for me to try or investigate further?