Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

gcloud cli sometimes unavailable on Compute Engine VM during batch job

I've created an image of a disk that I manually setup, it's an Ubuntu 20.04 install, I have gcloud working, and I have a batch job that uses gcloud. 95% of the time, it works well, but 5% of the time, I get a "gcloud not found" error when running a batch job. It's very strange that it only sometimes is unavailable but usually is there. Any ideas why this would occur?

0 1 164
1 REPLY 1

Without more details, I don't think we can give you a specific answer ... however, when I hear stories such that "it works 95% of the time", that usually leads me to think two thoughts.

1. A race condition.  Without knowing the details of how your batch job is running, it could be that the image is doing some initialization .. for example upgrading/installing Google Cloud SDK .. which includes gcloud ... and the actual execution of the batch logic is in a race.   95% of the time the initialization completes before the batch job starts, 5% of the time it doesn't.  What we would want to look for is the "timing" of when the "usage" of gcloud happens related to any parallel updates and availability of gcloud itself.

2. Cached versions.  If you have been changing image versions around there may be some image versions that have gcloud installed and some which don't.    If we are running on Compute Engines which cache pulled images rather than getting the latest image each time, we could non-deterministically be using an older vs a newer image.