Is the model part of the container in the script - Page 2

Yash2384 · 04-10-2024 11:14 AM

I was looking into the code

# Set docker and quantization for AWQ quantized models
VLLM_DOCKER_URI = "us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20231127_0916_RC00"
quantized_model_id = "TheBloke/Llama-2-70B-chat-AWQ"
quantization_method = "awq"
machine_type = "g2-standard-24"
accelerator_type = "NVIDIA_TESLA_L4"
accelerator_count = 2
# Fill with the created service account.
service_account = ""
endpoint = aiplatform.Endpoint.create(display_name=f"llama2-quantized-endpoint")
vllm_args = [
    "--host=0.0.0.0",
    "--port=7080",
    f"--model={model_id}",
    f"--tensor-parallel-size={accelerator_count}",
    "--swap-space=16",
    "--gpu-memory-utilization=0.9",
    "--disable-log-stats",
    "--max-model-len=4000",
    f"--quantization={quantization_method}",
]
serving_docker_uri = VLLM_DOCKER_URI
model = aiplatform.Model.upload(
    display_name="llama2-quantized-model",
    serving_container_image_uri=serving_docker_uri,
    serving_container_command=["python", "-m", "vllm.entrypoints.api_server"],
    serving_container_args=vllm_args,
    serving_container_ports=[7080],
    serving_container_predict_route="/generate",
    serving_container_health_route="/ping",
)

model.deploy(
    endpoint=endpoint,
    machine_type=machine_type,
    accelerator_type=accelerator_type,
    accelerator_count=accelerator_count,
    deploy_request_timeout=1800,
    service_account=service_account,
)

to deploy a model on vertex AI. My query is the quantized model "TheBloke/Llama-2-70B-chat-AWQ" part of the docker image "us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20231127_0916_RC00" or from where does the script download the model to be deployed on vertex AI

I am asking this because I have a similar use case where I want to deploy my custom model into vertex AI. This is pytorch model. Not sure what "VLLM_DOCKER_URI" should I use here and where I need to keep my custom model (ex. google cloud storage or else..)