Hey,
We've noticed a weird behavior with our custom trained models deployed on private endpoints. In some cases the number of replicas goes to zero and it brakes our processing pipeline. We couldn't trace back the issue. In the logs it just shows that the workers are exiting.
After around 5 minutes, the replica was started again and continued working as it should.
I am attaching two screenshots from the logs and from cloud monitoring.
Any idea what would cause this issue ?
Thanks!
Possibly a resource concern, I would recommend monitoring the job for more descriptive error. As workers exiting is still a general error.
User | Count |
---|---|
2 | |
1 | |
1 | |
1 | |
1 |