AutoML Translation models response time

julian_alonso · 08-04-2022 05:49 AM

Hi,

We have several AutoML Translation models and we are facing timeout issues when the first translation requests are sent. We have to retry a second time to get the translations back. After this first request, it seems the model is kept "online", and subsequent requests to the same model are performing well.

What we don't really know is how long the models are kept online and ready for quick response times and how many models can be online simultaneously. We would like to have more information about this in order to handle the translation requests in a proper and controlled way.

Thank you,

Julian

josegutierrez

Normally when we use the custom models, we will load the model to the chip, if there are more frequent custom models, the least frequent models will be evicted from the chip, and the next when we use the evicted custom models, we need to reload the model to the chip, it will take around 15s. So what happens to the batch translation is that the customers tried to call batch translation with custom models, however, this is the first time for the model to be loaded into the chip, it will take 15s to be loaded, because they have the empty sentences in the output. But for the second try, the model has already been loaded, so we don't need to wait for 15s and we have all translated sentences in the output.

If there is an inconsistency should be related to the replicas, if we have multiple replicas loading the models at the same time, it is possible that one replica loads the model successfully before others and it starts serving the first request, however the second request is routed to other replicas.

jmonteiro

Can we get more information on how long the models are kept online and ready for quick response times, and how many models can be online simultaneously?