Re: GPU Machines in Vertex AI Pipelines failing wi...

davidnet · 04-12-2023 12:24 PM

Hi AI platform, I am using Vertex AI Pipelines to train detection models, and a since today my pipelines are failing due to a NCCL error on the GPU machine, I am creating the task as follows:

train_task = train_model(
        centernet_container_trainer_uri=centernet_uri_task.output,
        train_dataset=train_importer_task.output,
        test_dataset=test_importer_task.output,
        categories_json=categories_importer_task.output,
        pretrained_model=pretrained_model.output,
        num_iters=num_iters,
        batch_size=batch_size,
        lr=lr,
        num_epochs=num_epochs,
        lr_step=lr_step,
        gpus=gpus,
        num_workers=num_workers,
        val_intervals=val_intervals,
    )
    train_task.set_display_name("Train model").set_cpu_limit("12").set_memory_limit(
        "170G"
    ).add_node_selector_constraint("NVIDIA_TESLA_A100").set_gpu_limit(
        "2"
    ).set_env_variable(
        name="NCCL_SHM_DISABLE", value="1"
    )

Then after a few minutes my training step fails due to:

NCCL Error 2: unhandled system error

Which seems to be related to shm size on the container. I have tried setting and unsetting the variable `NCCL_SHM_DISABLE` but to no avail, any idea on how to get support for this issue?

Thanks

I am using vertex and kubeflow

davidnet

My logs are normal until they go to the training loop:

GPU Machines in Vertex AI Pipelines failing with RuntimeError: NCCL Error 2: unhandled system error