Hi AI platform, I am using Vertex AI Pipelines to train detection models, and a since today my pipelines are failing due to a NCCL error on the GPU machine, I am creating the task as follows:
train_task = train_model(
centernet_container_trainer_uri=centernet_uri_task.output,
train_dataset=train_importer_task.output,
test_dataset=test_importer_task.output,
categories_json=categories_importer_task.output,
pretrained_model=pretrained_model.output,
num_iters=num_iters,
batch_size=batch_size,
lr=lr,
num_epochs=num_epochs,
lr_step=lr_step,
gpus=gpus,
num_workers=num_workers,
val_intervals=val_intervals,
)
train_task.set_display_name("Train model").set_cpu_limit("12").set_memory_limit(
"170G"
).add_node_selector_constraint("NVIDIA_TESLA_A100").set_gpu_limit(
"2"
).set_env_variable(
name="NCCL_SHM_DISABLE", value="1"
)
Then after a few minutes my training step fails due to:
NCCL Error 2: unhandled system error
Which seems to be related to shm size on the container. I have tried setting and unsetting the variable `NCCL_SHM_DISABLE` but to no avail, any idea on how to get support for this issue?
Thanks
I am using vertex and kubeflow
My logs are normal until they go to the training loop: