Vertex AI tuning job for gemini 2.0stuck at runnin...

evo-stage

@dawnberdan @MJane
Vertex AI tuning job for gemini 2.0 stuck at running for 16 hours with a small data set, i.e, 30 entries. It was working fine before, but now it's stuck. By default quota is 1 job, I guess, and in my case, only 1 is running.

I've tried cancelling and restarting, but no effect.
Data dataset is already very simple
all default parameters.

2) How to view logs ?

nb183

I am also having the same issue. It's not clear what is causing it. No logs whatsoever. Strange. I will post if I find any workaround.

evo-stage

Thanks. let me know when it resolved for you.

MarvinLlamas

Hi @evo-stage,

Welcome to the Google Cloud Community!

It looks like you are encountering an issue where your Vertex AI Gemini 2.0 tuning job remains indefinitely stuck in the "Validating dataset" stage, even though the dataset is quite small, and the root cause is difficult to pinpoint due to a lack of access to the job logs.

Here are the potential ways that might help with your use case:

Access and Analyze Cloud Logs: To find out why your Vertex AI job is stuck, you can use the Logs Explorer in Google Cloud Console. Filter by your job and check for any errors or warnings.These logs will help you spot issues with your data validation, formatting, permissions, or timeouts that might be affecting your job.
Check Google Cloud Status & Permissions: You may want to start by checking the Google Cloud Status Dashboard for any service outages in your region. Then, confirm that your Vertex AI service account has the right permissions, specifically the Vertex AI User role and Storage Object Viewer or Creator roles, on your dataset's GCS bucket. Lastly, ensure that both your job and your GCS bucket are located in the same region to avoid any mismatches or latency.
Restart with a New Job Name: If your previous restarts didn’t resolve the issue, you should cancel your current job and launch a brand new tuning job using the same dataset, but give it a completely different, unique name. This can help you clear out any cached or stuck job states that might be lingering.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

evo-stage

@MarvinLlamas I've checked the status, no service outages. No issues with the permissions also as it was working perfectly fine till 16th of june.
I've restarted it multiple times and tried to create new job, obviously name is different by adding a timestamp.

I've tried to check cloud logs, that are too dificult to see. cant understand them where to find errors. but in logs I can;t see any errors and also I couldn't find any filter by jobs. It's almost 36 hrs now

Bucket location is : US (multiple regions in United States) while tuning in US Central

evo-stage

@MarvinLlamas I've also checked in logs for my tuning job id theres only one log when tuning job started and there state is like pending

Vertex AI tuning job for gemini 2.0stuck at running for 16 hours with small data set i.e 30 entries