Hello!
I'm using Vertex AI online prediction endpoint, and I get CUDA out of memory error when some requests are coming at the same time.
For instance if I have 2 replicas, and make 3 requests with big enough datasets, I get a CUDA out of memory error. This never happens if I send one request at a time with the exact same datasets. My impression is that 2 of these requests are going on the same GPU which doesn't have enough memory to handle both.
I was under the impression that Vertex came with a sort of queue, and that requests were dealt with one after the other on each replica if they were more requests than replicas, is it not the case? Or is it a bug? There also doesn't seem to be an issue if I send 1000 of smaller requests, so it does seem that all the requests are not sent together to the GPU. But maybe it can happen that two of them are?
Thanks in advance for your help!