Internal Server Error When Sending Predict Request...

HLG · 02-22-2024 02:14 AM

Issue Description

When sending prediction requests to a Vertex AI endpoint using the Mistral model, I encounter an InternalServerError with details hinting at resource constraints or async execution issues, specifically mentioning OutOfMemoryError and errors in async_llm_engine.py, model_runner.py, and mixtral.py.

Expected Behavior

The model should process the provided prompt and return a generated response without internal server errors.

Actual Behavior

Requests to the model via the Vertex AI endpoint result in an _InactiveRpcError and a 500 Internal Server Error, indicating potential memory allocation or async execution failures.

In fact for a "small" request under around 100 token the model respond but over 100 token there is this error.

Steps to Reproduce

Deploy the Mistral model to Vertex AI following the steps in this notebook.
Configure the endpoint with the specified parameters.
Send a prediction request to the endpoint with a detailed prompt.
Observe the returned internal server error.

Specifications

Model Version: Mixtral-8x7B-Instruct-v0.1
Platform: Colab Enterprise in vertex ai

Logs and Error Messages

The error logs indicate issues such as OutOfMemoryError and failures in async execution paths within the model's implementation.

vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.

Above this logs there is this one :
"message": "torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 5 has a total capacty of 21.96 GiB of which 14.88 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 19.00 GiB is allocated by PyTorch, and 115.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF"

Maybe it's a memory issue but im following the exact notebook specification with this config and with a context windows of max_model_len = 4096 as specified:

# Sets 8 L4s to deploy Mixtral 8x7B. machine_type = "g2-standard-96" accelerator_type = "NVIDIA_L4" accelerator_count = 8

Additional Context

This issue appears to be related to how the model processes large inputs or manages resources during execution.
Github Issue LINK #2715