429 error from metadata.google.internal when tryin...

breathe · 07-18-2024 01:20 PM

I just got two errors like this in a python worker process that uses the `google.cloud.storage.transfer_manager.upload_many` function to upload intermediate data as it runs. The worker is long-lived batch process that is running in a docker container on c2 instance via google batch -- so using the latest container optimized os imge.

google.auth.exceptions.TransportError: ('Failed to retrieve http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/pr-0-zephyrus-sa@zephr-... from the Google Compute Engine metadata service. Status: 429 Response:\nb\'"Too many requests."\'', <google.auth.transport.requests._Response object at 0x79ff3da7bb30>)

I had maybe 200-ish similar instances running at the time we hit the failure. I can't for the life of me figure out what the quota's are for the metadata server -- do I need to adjust them (we just significantly increasing the number of concurrent workers allocated to this process) -- or is there something crazy/unexpected happening inside the worker causing way more instance metadata requests than needed/expected ...

I've been googling for awhile and can't find any leads to track this down.

rish

Hi I am facing similar issue. Can anyone help

GAC-BSP

Same here. We started receiving this error about a week back. No problem before that, for at least two years, and nothing was changed infra side recently.

karase

Hi, we also started observing the same error recently.

dorz

Same thing started on our GKE clusters a week ago. anyone have any idea why this is happening?

breathe

My theory is that we are hitting one of the mutative API limits related to token refresh -- afaict these are quite low and can't be adjusted -- the metadata endpoint hits the limit internally and passes back a 429 without information about the actual quota being violated.

In my case I suspect additionally that the upload_many function, which spins up n worker threads to do concurrent uploading, is probably refreshing the token in every worker thread (in thread local state?) -- which 10x's the max rate of token refreshes vs what the workload requires ...

AishikRoy

This happens with me inside a dataflow worker...i am really not aware of this...what can be the issue...i have added a retry to this in the code....but why it is happening...the authorizing mechanism is still very confusing to mee

429 error from metadata.google.internal when trying to refresh auth token