Installing Nvidia Container Toolkit for GCP Batch Jobs

I am trying to leverage `torchrun` to run GPU computing with GCP Batch. However, it requires a nvidia-container toolkit: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

I cannot find any interface to pre-install this nvidia-container toolkit before I start up my container. Is there any way to work around this? 

---Update---

I noticed that google's batch job always comes with the following startup script in the instance template. 

```

#!/bin/bash cp -n /proc/uptime /var/tmp/google-batch-startup.txt scurl() { curl --retry 5 -m 10 -fsSL "$@" } logstate() { local T T="$(printf '%(%Y/%m/%d-%H:%M:%S%z)T')" scurl -X PUT --data "$T,startup,${BASH_LINENO[0]},$1" -H "Metadata-Flavor: Google" \ http://metadata.google.internal/computeMetadata/v1/instance/guest-attributes/cloudbatch/vmstate || true echo "[Batch Startup] $2" } OSID="$(. /etc/os-release && echo "$ID")" AGENT_VERSION="$(echo cloud-batch-agent_20231101.00_p00 | sed 's/.*agent_//;s/_/./')-0" MACHINE="$(uname -m)" if [[ "$OSID" == debian ]] ; then if [[ ! -f /usr/bin/cloud-batch-agent ]]; then logstate deb_update 'Updating packages.' echo "deb https://us-central1-apt.pkg.dev/projects/cloud-batch-content cloud-batch-deb main" | tee /etc/apt/sources.list.d/google-cloud-batch-agent.list scurl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add - scurl https://us-central1-apt.pkg.dev/doc/repo-signing-key.gpg | sudo apt-key add - apt-get -o Acquire::Retries=3 -o DPkg::Lock::Timeout=60 update logstate deb_updated 'Package update completed.' logstate agent_install 'Installing Google Cloud Batch service VM agent.' apt-get -o Acquire::Retries=3 -o DPkg::Lock::Timeout=60 install -y cloud-batch-agent="$AGENT_VERSION" logstate agent_installed 'Google Cloud Batch service VM agent installed.' fi if [[ ! -f /etc/rules ]]; then iptables -A INPUT -j ACCEPT iptables-save > /etc/rules logstate iptables-update 'Iptables setting saved.' fi iptables-restore < /etc/rules systemctl start cloud-batch-agent.service elif [[ -f /etc/centos-release ]] || [[ -f /etc/redhat-release ]] || [[ -f /etc/oracle-release ]] || [[ -f /etc/system-release ]]; then if [[ ! -f /usr/bin/cloud-batch-agent ]]; then echo "[cloud-batch-agent] name=cloud-batch-agent (Artifact Registry) baseurl=https://us-central1-yum.pkg.dev/projects/cloud-batch-content/cloud-batch-rpm enabled=1 gpgcheck=0 repo_gpgcheck=0" | tee /etc/yum.repos.d/cloud-batch-agent.repo logstate yum_install 'Installing Google Cloud Batch service VM agent.' yum install -y --disablerepo='*' --enablerepo='cloud-batch-agent' cloud-batch-agent-"$AGENT_VERSION" logstate yum_agent_installed 'Google Cloud Batch service VM agent installed.' fi if [[ ! -f /etc/rules ]]; then iptables -A INPUT -j ACCEPT iptables-save > /etc/rules logstate iptables-update 'Iptables setting saved.' fi iptables-restore < /etc/rules systemctl start cloud-batch-agent.service elif [[ "$OSID" == cos ]]; then if [[ ! -f /var/lib/google/rules ]]; then iptables -A INPUT -j ACCEPT iptables-save > /var/lib/google/rules fi iptables-restore < /var/lib/google/rules if [[ -f /var/lib/google/agent ]]; then printf -v logfile /var/lib/google/agent_log.'%(%Y-%m-%d-%H:%M:%S)T' logstate agent_log_file_ready 'Google Cloud Batch service VM agent log file is ready.' /var/lib/google/agent >> "${logfile}" 2>&1 else logstate unsupported_cos 'Unsupported COS image.' exit 200 fi else logstate unsupported_os 'Unsupported distribution: Not pre-installing packages' fi systemctl is-active -q cloud-batch-agent && exit 0 printf -v logfile ~/agent_log.'%(%Y-%m-%d-%H:%M:%S)T' logstate agent_log_file_ready 'Google Cloud Batch service VM agent log file is ready.' GCS_PATH="gs://batch-agent-prod-us/agent/cloud-batch-agent_20231101.00_p00-$MACHINE/cloud-batch-agent" gsutil cp "${GCS_PATH}" ~/agent >> "${logfile}" 2>&1 logstate agent_copied 'Copied Google Cloud Batch service VM agent from Google Cloud Storage.' chmod a+x ~/agent ~/agent >> "${logfile}" 2>&1 &

```

Can we do a similar way to install nvidia-toolkit packages to the host machine before starting the container?

 

 

0 0 323
0 REPLIES 0