Running batch job on LLM

I tried to follow the guide of get batch text predictions. The job ran successfully but there are rows in the result that stated resource exhausted and no result. Any setting that I can do to automatically retry those rows that are failed?

I am using the Python Vertex ai sdk

4 6 682
6 REPLIES 6

I am having the same problem. 

I am submitting prompts in JSONL format from files saved in a cloud storage bucket. I have tried varying the number of files submitted in a request and also the number of prompts in each file. I had better success when each file had 100 prompts, and they were submitted as separate batch requests. This would be fine, but I can't work out how to create the batch request asynchronously, and so it's not really any better than just running the process without batching - I still need a python process running until all the requests batches are complete.

When I submit all the prompts in a single file as one batched request, then >50% return RESOURCE EXHAUSTED errors in the status.   

Hey,

I'm having exactly the same issue...

However, the documentation states this: "Batch requests for text models only accept BigQuery storage sources and Google Cloud Storage. Requests can include up to 30,000 prompts."

It's apparently wrong.

Update: my current trial seems to be working without any RESOURCE EXHAUSTED errors.

It consists of small batches of 50 prompts (although worth noting that the prompts are long - usually around 5,000 characters). Create separate JSONL files for each batch of 50, and then create a new instance of model.batch_predict() for each file. i.e. something like:

for dataset in prompt_files_list:
    batch_prediction_job = model.batch_predict(dataset=[dataset],
                                       destination_uri_prefix=destination,
                                       model_parameters=params)

So far this approach has produce 1,500 responses without any errors, whereas the previous attempt (putting all the prompts in one large file) resulted in 2,000 failures out of 3,500. 

Thank you for your method. But it does not seem to be the intention of a batch job. Hope someone from Google can answer the question

i think your approach work because it does not exceed the 60 requests per minutes quota.

is it I need to adjust ‘text bison batch’ quota?

Agreed. Possibly need to adjust the batch quota, but mine is currently set to zero, so presumably the service shouldn't work at all. None of the other quotas seem to be relevant. 

I see the only available quota for batch prediction using text bison in google is Concurrent large language model batch prediction jobs running on text-bison model per region. This might solve a part of the problem by changing the default val from 4 to a higher number and submit multiple smaller chunk request for batch prediction. 

I still find it confusing,  as per the google documentation the default quota limit for per job is < 30,000 and there doesn't seem to be a way to adjust the machine configuration to get higher Memory capacity for the request (maybe) . Does anyone have  a better workaround to solve the Resource Exhausted Error?