I have a python script (some of the values are changed). It works in a vertex ai workbench. It works as a docker container in the workbench too. I am trying to set it up to run in vertex ai training custom job pipeline and it's where I am hitting strange hanging issues.
My custom job:
aiplatform.init(location="")
job = aiplatform.CustomContainerTrainingJob(
display_name = "name"
,container_uri = "location-docker.pkg.dev/project/registry/docker_container"
,staging_bucket = "myBucket"
)
job.run(replica_count = 1
,machine_type = "e2-standard-4"
,enable_web_access = True
,timeout = 900
,args=[]
)
I sshed into the worker and tried to run script from cli for more detailed logs. I get cuda warnings (normal, I have no gpus) and then nothing (log explorer is the same). If I end the process I get error:
^CTraceback (most recent call last):
File "/app/main.py", line 175, in <module>
main()
File "/app/main.py", line 126, in main
BQApi.log_start()
File "/app/bqApi.py", line 78, in log_start
self.Client.query(sqlLogStart, project=self.project, job_config= jc)
File "/app/venv/lib/python3.12/site-packages/google/cloud/bigquery/client.py", line 3502, in query
return _job_helpers.query_jobs_insert(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/venv/lib/python3.12/site-packages/google/cloud/bigquery/_job_helpers.py", line 159, in query_jobs_insert
future = do_query()
^^^^^^^^^^
File "/app/venv/lib/python3.12/site-packages/google/cloud/bigquery/_job_helpers.py", line 136, in do_query
query_job._begin(retry=retry, timeout=timeout)
...
File "/app/venv/lib/python3.12/site-packages/requests/adapters.py", line 667, in send
resp = conn.urlopen(
^^^^^^^^^^^^^
File "/app/venv/lib/python3.12/site-packages/urllib3/connectionpool.py", line 773, in urlopen
self._prepare_proxy(conn)
File "/app/venv/lib/python3.12/site-packages/urllib3/connectionpool.py", line 1042, in _prepare_proxy
conn.connect()
File "/app/venv/lib/python3.12/site-packages/urllib3/connection.py", line 704, in connect
self.sock = sock = self._new_conn()
In bqApi I define self.Client like this:
from google.cloud import bigquery
from google import auth
#somewhere in __init__
self.project = "project"
credentials, project = auth.default()
self.Client = bigquery.Client(project=self.project,credentials=credentials, location = "location")
The error seems to be happening here:
jc = next(self.create_bq_job_config())
self.Client.query(sqlLogStart, project=self.project, job_config= jc)
create_job_config() is doing this:
while True:
yield bigquery.QueryJobConfig(query_parameters=[
bigquery.ScalarQueryParameter("start_time", "DATETIME", self.startTime)
]
)
I think training custom job is unable to reach BQ for some reason. Any ideas what it could be and how I could fix this?
Hi @VAC,
Welcome to Google Cloud Community!
It looks like you're dealing with an issue where your OP's Vertex AI Custom Training Job is hanging indefinitely when it tries to execute a BigQuery query. The code works fine in other environments, suggesting the problem lies in your Custom Job's configuration.
Here are the potential ways that might help with your use case:
Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.
User | Count |
---|---|
2 | |
1 | |
1 | |
1 | |
1 |