Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Vertex AI custom job training pipeline unable to query bigquery

VAC
Bronze 1
Bronze 1

I have a python script (some of the values are changed). It works in a vertex ai workbench. It works as a docker container in the workbench too. I am trying to set it up to run in vertex ai training custom job pipeline and it's where I am hitting strange hanging issues.

My custom job:

 

aiplatform.init(location="")

job = aiplatform.CustomContainerTrainingJob(
display_name = "name"
,container_uri = "location-docker.pkg.dev/project/registry/docker_container"
,staging_bucket = "myBucket"
)

job.run(replica_count = 1
,machine_type = "e2-standard-4"
,enable_web_access = True
,timeout = 900
,args=[]
)

 

I sshed into the worker and tried to run script from cli for more detailed logs. I get cuda warnings (normal, I have no gpus) and then nothing (log explorer is the same). If I end the process I get error:

 

^CTraceback (most recent call last):                                                                                                                                                                                                                                                                                  
  File "/app/main.py", line 175, in <module>                                                                                                                                                                                                                                                                          
    main()                                                                                                                                                                                                                                                                                                            
  File "/app/main.py", line 126, in main                                                                                                                                                                                                                                                                              
    BQApi.log_start()                                                                                                                                                                                                                                                                                                 
  File "/app/bqApi.py", line 78, in log_start                                                                                                                                                                                                                                                                         
    self.Client.query(sqlLogStart, project=self.project, job_config= jc)                                                                                                                                                                                                                                              
  File "/app/venv/lib/python3.12/site-packages/google/cloud/bigquery/client.py", line 3502, in query                                                                                                                                                                                                                  
    return _job_helpers.query_jobs_insert(                                                                                                                                                                                                                                                                            
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                                                                            
  File "/app/venv/lib/python3.12/site-packages/google/cloud/bigquery/_job_helpers.py", line 159, in query_jobs_insert                                                                                                                                                                                                 
    future = do_query()                                                                                                                                                                                                                                                                                               
             ^^^^^^^^^^                                                                                                                                                                                                                                                                                               
  File "/app/venv/lib/python3.12/site-packages/google/cloud/bigquery/_job_helpers.py", line 136, in do_query                                                                                                                                                                                                          
    query_job._begin(retry=retry, timeout=timeout)  
...
 File "/app/venv/lib/python3.12/site-packages/requests/adapters.py", line 667, in send                                                                                                                                                                                                                               
    resp = conn.urlopen(                                                                                                                                                                                                                                                                                              
           ^^^^^^^^^^^^^                                                                                                                                                                                                                                                                                              
  File "/app/venv/lib/python3.12/site-packages/urllib3/connectionpool.py", line 773, in urlopen                                                                                                                                                                                                                       
    self._prepare_proxy(conn)                                                                                                                                                                                                                                                                                         
  File "/app/venv/lib/python3.12/site-packages/urllib3/connectionpool.py", line 1042, in _prepare_proxy                                                                                                                                                                                                               
    conn.connect()                                                                                                                                                                                                                                                                                                    
  File "/app/venv/lib/python3.12/site-packages/urllib3/connection.py", line 704, in connect                                                                                                                                                                                                                           
    self.sock = sock = self._new_conn()   

 

In bqApi I define self.Client like this:

 

from google.cloud import bigquery
from google import auth

#somewhere in __init__
self.project = "project"
credentials, project = auth.default()
self.Client = bigquery.Client(project=self.project,credentials=credentials, location = "location")

 

The error seems to be happening here:

 

jc = next(self.create_bq_job_config())
self.Client.query(sqlLogStart, project=self.project, job_config= jc)

 

create_job_config() is doing this:

 

while True:
            yield bigquery.QueryJobConfig(query_parameters=[
                bigquery.ScalarQueryParameter("start_time", "DATETIME", self.startTime)
            ]
    )

 

 

I think training custom job is unable to reach BQ for some reason. Any ideas what it could be and how I could fix this?

1 REPLY 1

Hi @VAC,

Welcome to Google Cloud Community!

It looks like you're dealing with an issue where your OP's Vertex AI Custom Training Job is hanging indefinitely when it tries to execute a BigQuery query. The code works fine in other environments, suggesting the problem lies in your Custom Job's configuration.

Here are the potential ways that might help with your use case:

  • Verify Service Account Permissions: Ensure that your  correct service account has the necessary BigQuery permissions. When you launch your custom training job, it runs as your service account. If you don't explicitly specify one, it uses the default Compute Engine service account. This is often the root cause. The default service account for Compute Engine might lack the necessary access to BigQuery.
  • Network Configuration: Make sure to enable Private Google Access for the subnet where your training job is running if both your BigQuery datasets/tables and training job are protected behind a Virtual Private Cloud (VPC). This allows your job to reach Google APIs, including BigQuery, without using the public internet.
  • Check Logs: You may want to examine your Vertex AI logs in the Google Cloud Console for your training job. Look for any authentication errors, permission denied messages, or network-related errors. Filter the logs specifically by your job's name and time frame. If the log explorer isn't showing details, ensure that verbose logging is enabled and add print statements that will show up in your cloud logging.
  • Double-Check google-auth: Ensure you have google-auth installed in your Dockerfile using the command pip install google-auth. Installing google-auth ensures that your code can properly authenticate with Google Cloud services using the service account assigned to your Custom Training Job, preventing authentication failures and simplifying your troubleshooting.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.