Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

TPU POD - no initiation

Dear List

I was trying a TPU POD v2-32 and it was created successfully. But running an example as shown in: https://cloud.google.com/tpu/docs/jax-pods  it produced the  following error as shown below.

If you people can guide on what is going on, it will be of great help.

(Actually I tried in v3-32  , it also shows the same thing!)

Thanks in advance

Thoma

 

mbctbiofuel@cloudshell:~ (mytpu1)$ gcloud compute tpus tpu-vm ssh node-1 --zone=us-central1-a --worker=all --command="python3 example.py"
SSH: Attempting to connect to worker 0...
SSH: Attempting to connect to worker 1...
SSH: Attempting to connect to worker 2...
SSH: Attempting to connect to worker 3...
Traceback (most recent call last):
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 435, in backends
backend = _init_backend(platform)
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 488, in _init_backend
backend = factory()
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 189, in tpu_client_timer_callback
client = xla_client.make_tpu_client()
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jaxlib/xla_client.py", line 173, in make_tpu_client
return make_tfrt_tpu_c_api_client()
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jaxlib/xla_client.py", line 106, in make_tfrt_tpu_c_api_client
return _xla.get_c_api_client('tpu', options)
jaxlib.xla_extension.XlaRuntimeError: ABORTED: The TPU is already in use by another process probably owned by another user. Run "$ sudo lsof -w /dev/accel0" to figure out which process is using the TPU. If you still get this message, run "$ sudo rm /tmp/libtpu_lockfile".

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "example.py", line 5, in <module>
device_count = jax.device_count()
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 564, in device_count
return int(get_backend(backend).device_count())
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 533, in get_backend
return _get_backend_uncached(platform)
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 514, in _get_backend_uncached
bs = backends()
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 452, in backends
raise RuntimeError(err_msg)
RuntimeError: Unable to initialize backend 'tpu': ABORTED: The TPU is already in use by another process probably owned by another user. Run "$ sudo lsof -w /dev/accel0" to figure out which process is using the TPU. If you still get this message, run "$ sudo rm /tmp/libtpu_lockfile". (set JAX_PLATFORMS='' to automatically choose an available backend)
Traceback (most recent call last):
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 435, in backends
backend = _init_backend(platform)
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 488, in _init_backend
backend = factory()
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 189, in tpu_client_timer_callback
client = xla_client.make_tpu_client()
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jaxlib/xla_client.py", line 173, in make_tpu_client
return make_tfrt_tpu_c_api_client()

File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jaxlib/xla_client.py", line 106, in make_tfrt_tpu_c_api_client
return _xla.get_c_api_client('tpu', options)
jaxlib.xla_extension.XlaRuntimeError: ABORTED: The TPU is already in use by another process probably owned by another user. Run "$ sudo lsof -w /dev/accel0" to figure out which process is using the TPU. If you still get this message, run "$ sudo rm /tmp/libtpu_lockfile".

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "example.py", line 5, in <module>
device_count = jax.device_count()
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 564, in device_count
return int(get_backend(backend).device_count())
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 533, in get_backend
return _get_backend_uncached(platform)
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 514, in _get_backend_uncached
bs = backends()
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 452, in backends
raise RuntimeError(err_msg)
RuntimeError: Unable to initialize backend 'tpu': ABORTED: The TPU is already in use by another process probably owned by another user. Run "$ sudo lsof -w /dev/accel0" to figure out which process is using the TPU. If you still get this message, run "$ sudo rm /tmp/libtpu_lockfile". (set JAX_PLATFORMS='' to automatically choose an available backend)
Traceback (most recent call last):
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 435, in backends
backend = _init_backend(platform)
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 488, in _init_backend
backend = factory()

File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 189, in tpu_client_timer_callback
client = xla_client.make_tpu_client()
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jaxlib/xla_client.py", line 173, in make_tpu_client
return make_tfrt_tpu_c_api_client()
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jaxlib/xla_client.py", line 106, in make_tfrt_tpu_c_api_client
return _xla.get_c_api_client('tpu', options)
jaxlib.xla_extension.XlaRuntimeError: ABORTED: The TPU is already in use by another process probably owned by another user. Run "$ sudo lsof -w /dev/accel0" to figure out which process is using the TPU. If you still get this message, run "$ sudo rm /tmp/libtpu_lockfile".

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "example.py", line 5, in <module>
device_count = jax.device_count()
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 564, in device_count
return int(get_backend(backend).device_count())
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 533, in get_backend
return _get_backend_uncached(platform)
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 514, in _get_backend_uncached
bs = backends()

File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 452, in backends
raise RuntimeError(err_msg)
RuntimeError: Unable to initialize backend 'tpu': ABORTED: The TPU is already in use by another process probably owned by another user. Run "$ sudo lsof -w /dev/accel0" to figure out which process is using the TPU. If you still get this message, run "$ sudo rm /tmp/libtpu_lockfile". (set JAX_PLATFORMS='' to automatically choose an available backend)
Traceback (most recent call last):
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 435, in backends
backend = _init_backend(platform)
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 488, in _init_backend
backend = factory()
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 189, in tpu_client_timer_callback
client = xla_client.make_tpu_client()
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jaxlib/xla_client.py", line 173, in make_tpu_client
return make_tfrt_tpu_c_api_client()
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jaxlib/xla_client.py", line 106, in make_tfrt_tpu_c_api_client
return _xla.get_c_api_client('tpu', options)
jaxlib.xla_extension.XlaRuntimeError: ABORTED: The TPU is already in use by another process probably owned by another user. Run "$ sudo lsof -w /dev/accel0" to figure out which process is using the TPU. If you still get this message, run "$ sudo rm /tmp/libtpu_lockfile".

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "example.py", line 5, in <module>
device_count = jax.device_count()
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 564, in device_count
return int(get_backend(backend).device_count())
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 533, in get_backend
return _get_backend_uncached(platform)
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 514, in _get_backend_uncached
bs = backends()
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 452, in backends
raise RuntimeError(err_msg)
RuntimeError: Unable to initialize backend 'tpu': ABORTED: The TPU is already in use by another process probably owned by another user. Run "$ sudo lsof -w /dev/accel0" to figure out which process is using the TPU. If you still get this message, run "$ sudo rm /tmp/libtpu_lockfile". (set JAX_PLATFORMS='' to automatically choose an available backend)
##### Command execution on worker 2 failed with exit status 1. Continuing.
##### Command execution on worker 1 failed with exit status 1. Continuing.
##### Command execution on worker 3 failed with exit status 1. Continuing.
##### Command execution on worker 0 failed with exit status 1. Continuing.

mbctbiofuel@cloudshell:~ (mytpu1)$ ls /dev/accel*
ls: cannot access '/dev/accel*': No such file or directory

Solved Solved
0 3 1,855
1 ACCEPTED SOLUTION

I used https://cloud.google.com/tpu/docs/jax-pods for a re try.

Your suggestion [1] tried and worked

1. gcloud compute tpus tpu-vm create tpu-name --zone=us-central1-a --accelerator-type=v3-32 --version=tpu-vm-v4-base

2. gcloud compute tpus tpu-vm ssh tpu-name \
--zone=us-central1-a --worker=all --command="pip install \
--upgrade 'jax[tpu]>0.3.0' \
-f https://storage.googleapis.com/jax-releases/libtpu_releases.html" \
--project=mytpu1

...

3. gcloud compute tpus tpu-vm scp example.py tpu-name: --worker=all --zone=us-central1-a

4. gcloud compute tpus tpu-vm ssh tpu-name --zone=us-central1-a --worker=all --command="python3 example.py"

mbctbiofuel@cloudshell:~ (mytpu1)$ gcloud compute tpus tpu-vm ssh tpu-name --zone=us-central1-a --worker=all --command="python3 example.py"
SSH: Attempting to connect to worker 0...
SSH: Attempting to connect to worker 1...
SSH: Attempting to connect to worker 2...
SSH: Attempting to connect to worker 3...
global device count: 32
local device count: 8
pmap result: [32. 32. 32. 32. 32. 32. 32. 32.]

It worked!

I GUESS that if we create v2-32 MANUALLY by using GCLOUD it works quite ok...!

Thnx

 

View solution in original post

3 REPLIES 3

Good day @biofuel,

Welcome to Google Cloud Community!

There are several reasons why you are encountering this error, you can validate these solutions if it will solve your problem:

1. You can try to ssh to the instance then try running the gcloud command in there.
You can use this link for more information: https://cloud.google.com/sdk/gcloud/reference/alpha/compute/tpus/tpu-vm/ssh#--command

2. You can also try running the python script thru the TPU start up script, this will ensure that the script is running in the background after the TPU is created. You can use this link for more information: 
https://cloud.google.com/compute/docs/instances/startup-scripts
https://cloud.google.com/compute/docs/instances/startup-scripts/linux#accessing-metadata

2. You can try running the commands stated in the exception: 

 

sudo lsof -w /dev/accel0
sudo rm /tmp/libtpu_lockfile

 


4. Try also using this guide on running a calculation on Cloud TPU VM using JAX: https://cloud.google.com/tpu/docs/run-calculation-jax

I also recommend that you reach out to Google Cloud Support for this inquiry: https://cloud.google.com/support

Hope this helps!

Dear kvandres

Many thanks for the quick help

After trying this I will mail in detail
ciao
T

I used https://cloud.google.com/tpu/docs/jax-pods for a re try.

Your suggestion [1] tried and worked

1. gcloud compute tpus tpu-vm create tpu-name --zone=us-central1-a --accelerator-type=v3-32 --version=tpu-vm-v4-base

2. gcloud compute tpus tpu-vm ssh tpu-name \
--zone=us-central1-a --worker=all --command="pip install \
--upgrade 'jax[tpu]>0.3.0' \
-f https://storage.googleapis.com/jax-releases/libtpu_releases.html" \
--project=mytpu1

...

3. gcloud compute tpus tpu-vm scp example.py tpu-name: --worker=all --zone=us-central1-a

4. gcloud compute tpus tpu-vm ssh tpu-name --zone=us-central1-a --worker=all --command="python3 example.py"

mbctbiofuel@cloudshell:~ (mytpu1)$ gcloud compute tpus tpu-vm ssh tpu-name --zone=us-central1-a --worker=all --command="python3 example.py"
SSH: Attempting to connect to worker 0...
SSH: Attempting to connect to worker 1...
SSH: Attempting to connect to worker 2...
SSH: Attempting to connect to worker 3...
global device count: 32
local device count: 8
pmap result: [32. 32. 32. 32. 32. 32. 32. 32.]

It worked!

I GUESS that if we create v2-32 MANUALLY by using GCLOUD it works quite ok...!

Thnx