Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

LLAMA 2 in Vertex AI not working

I deployed llama-2 13 B and 70 B in vertex ai through the model garden. Deployment was successful but when I am hitting the endpoint through curl I keep on getting below error. Has anyone tried llama-2 in vertex ai?

{
  "error": {
    "code": 503,
    "message": "Took too long to respond when processing endpoint_id: {endpoint_id}, deployed_model_id: {deployed_model_id}",
    "status": "UNAVAILABLE"
  }
}
2 13 5,785
13 REPLIES 13

Same Here, for batch predictions using the colab provided, i have the docker image does not accept the recommanded accelerator,

and from the endpoint, the logs shows this error: (to complete the timeout)

ValueError: The current `device_map` had weights offloaded to the disk. Please provide an `offload_folder` for them. Alternatively, make sure you have `safetensors` installed if the model you are using offers the weights in this format.

I had it working for the batch predictions using a regions with the recommanded gpu accelerators, it does seems to be a matter of gpu availability in the region

From here: https://cloud.google.com/vertex-ai/docs/general/locations?hl=fr#region_considerations

I could find a region for me for a llama2-7b with a V100 gpu available for predictions (the ones without the *)

It is possible this is a concern with your resources that vertex ai uses in your project or your project hitting quota (You can check these on Logging and In your quota page respectfully) . I would recommend contacting Google support to further investigate your concern: https://cloud.google.com/contact

@nceniza problem is support in google cant be contacted without paid plan

Receiving the same error. 

---------------------------------------------------------------------------
_InactiveRpcError                         Traceback (most recent call last)
File /opt/conda/lib/python3.10/site-packages/google/api_core/grpc_helpers.py:65, in _wrap_unary_errors.<locals>.error_remapped_callable(*args, **kwargs)
     64 try:
---> 65     return callable_(*args, **kwargs)
     66 except grpc.RpcError as exc:

File /opt/conda/lib/python3.10/site-packages/grpc/_channel.py:946, in _UnaryUnaryMultiCallable.__call__(self, request, timeout, metadata, credentials, wait_for_ready, compression)
    944 state, call, = self._blocking(request, timeout, metadata, credentials,
    945                               wait_for_ready, compression)
--> 946 return _end_unary_response_blocking(state, call, False, None)

File /opt/conda/lib/python3.10/site-packages/grpc/_channel.py:849, in _end_unary_response_blocking(state, call, with_call, deadline)
    848 else:
--> 849     raise _InactiveRpcError(state)

_InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "Took too long to respond when processing endpoint_id: {endpoint_id}, deployed_model_id: {deployed_model_id}"
	debug_error_string = "UNKNOWN:Error received from peer ipv4:{ipv4}:443 {created_time:"2023-09-07T12:24:43.661919946+00:00", grpc_status:14, grpc_message:"Took too long to respond when processing endpoint_id: {endpoint_id}, deployed_model_id: {deployed_model_id}}"

Seems like google has released it without testing . Unfortunately  no one from google team is helping on this.

I'm having the same issue here. Fortunately one of the Endpoint is working with the configurations below. Non of the others with different machine types, or accelerators didn't work. Check whether the same works for you all.

Working config,

Region : us-central1 (Iowa)

Access : standard

Model : llama2-7b-chat

machine : n1-standard-4

Accelerator : NVIDIA_TESLA_T4

Accelerator count : 1

@Achila thanks  your config worked for 7B. But api is very slow. Also I am still looking for solution for 70B. Did you get any success on it.

Nope. Still trying :(. Found out there were multiple downtimes as well in GCP recently. I think AIModelGarden isn't stable yet.

Anyone got the right configurations for deployment of LLAMA2 13B ?

Following configuration no more works for the deployment of LLAMA2 7B (Chat).

Region : us-central1 (Iowa)

Access : standard

Model : llama2-7b-chat

machine : n1-standard-4

Accelerator : NVIDIA_TESLA_T4

Accelerator count : 1


Error: ValueError: Too large swap space. 16.00 GiB out of the 14.65 GiB total CPU memory is allocated for the swap space.

I'm now trying to deploy LLaMa 2 7B on the configuration @Achila suggested, but it doesn't work. Was anybody able to fix it?

It's not possible to use these configs when "one-click" deploying llama2-7b because of the swap memory required. You can use n1-standard-8 instead, which has more memory capacity. It will be more expensive though