I was looking into the code
# Set docker and quantization for AWQ quantized models
VLLM_DOCKER_URI = "us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20231127_0916_RC00"
quantized_model_id = "TheBloke/Llama-2-70B-chat-AWQ"
quantization_method = "awq"
machine_type = "g2-standard-24"
accelerator_type = "NVIDIA_TESLA_L4"
accelerator_count = 2
# Fill with the created service account.
service_account = ""
endpoint = aiplatform.Endpoint.create(display_name=f"llama2-quantized-endpoint")
vllm_args = [
"--host=0.0.0.0",
"--port=7080",
f"--model={model_id}",
f"--tensor-parallel-size={accelerator_count}",
"--swap-space=16",
"--gpu-memory-utilization=0.9",
"--disable-log-stats",
"--max-model-len=4000",
f"--quantization={quantization_method}",
]
serving_docker_uri = VLLM_DOCKER_URI
model = aiplatform.Model.upload(
display_name="llama2-quantized-model",
serving_container_image_uri=serving_docker_uri,
serving_container_command=["python", "-m", "vllm.entrypoints.api_server"],
serving_container_args=vllm_args,
serving_container_ports=[7080],
serving_container_predict_route="/generate",
serving_container_health_route="/ping",
)
model.deploy(
endpoint=endpoint,
machine_type=machine_type,
accelerator_type=accelerator_type,
accelerator_count=accelerator_count,
deploy_request_timeout=1800,
service_account=service_account,
)
to deploy a model on vertex AI. My query is the quantized model "TheBloke/Llama-2-70B-chat-AWQ" part of the docker image "us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20231127_0916_RC00" or from where does the script download the model to be deployed on vertex AI
I am asking this because I have a similar use case where I want to deploy my custom model into vertex AI. This is pytorch model. Not sure what "VLLM_DOCKER_URI" should I use here and where I need to keep my custom model (ex. google cloud storage or else..)
User | Count |
---|---|
2 | |
1 | |
1 | |
1 | |
1 |