Vertex pipeline model training component stuck run...

jan-zajac · 09-09-2022 02:25 AM

'm attempting to run a Vertex pipeline (custom model training) which I was able to run successfully in a different project. As far as I'm aware, all the pieces of infrastructure (service accounts, buckets, etc.) are identical.

The error appears in a gray box in the pipeline UI when I click on the model training component and reads the following:

Retryable error reported. System is retrying.
com.google.cloud.ai.platform.common.errors.AiPlatformException: code=ABORTED, message=Specified Execution `etag`: `1662555654045` does not match server `etag`: `1662555533339`, cause=null System is retrying.

I've looked into the log explorer and found that the error logs are audit logs have the following associated tags with them:

protoPayload.methodName="google.cloud.aiplatform.internal.MetadataService.RefreshLineageSubgraph"

protoPayload.resourceName="projects/724306335858/locations/europe-west4/metadataStores/default

Leading me to think that there's an issue with the Vertex Metadatastore or the way my pipeline is using it. The audit logs are automatic though, so I'm not sure.

I've tried purging the metadata store as well as deleting it completely. I've also tried running a different model training pipeline that worked before in a different project as well but with no luck.

Vertex pipeline model training component stuck running forever because of metadata issue