Announcements
This site is in read only until July 22 as we migrate to a new platform; refer to this community post for more details.
Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

TPU initialization error while running training script

I am trying to train my pytorch model on a TPU pod v3-32 but it shows me the following error while running my training script:

2023-05-13 16:35:49.648194: F tensorflow/tsl/platform/statusor.cc:33] Attempting to fetch value instead of handling error UNKNOWN: TPU initialization failed: open(/dev/accel0): Operation not permitted: Operation not permitted; Couldn't open device: /dev/accel0; Unable to create Node RegisterInterface for node 0, config: device_path: "/dev/accel0" mode: KERNEL debug_data_directory: "" dump_anomalies_only: true crash_in_debug_dump: false allow_core_dump: true; could not create driver instance
https://symbolize.stripped_domain/r/?trace=7f7acf09500b,7f7acf09508f,7f79a4342bff,7f79a4646a26,7f79a...
*** SIGABRT received by PID 317591 (TID 317591) on cpu 25 from PID 317591; stack trace: ***
PC: @ 0x7f7acf09500b (unknown) raise
@ 0x7f79a07c2a1a 1152 (unknown)
@ 0x7f7acf095090 468830624 (unknown)
@ 0x7f79a4342c00 400 tsl::internal_statusor::Helper::Crash()
@ 0x7f79a4646a27 768 xla::PjRtComputationClient::PjRtComputationClient()
@ 0x7f79a4629a72 1440 xla::ComputationClient::Create()
@ 0x7f79a462bfe3 32 std::call_once<>()::{lambda()#2}::_FUN()
@ 0x7f7acf0404df (unknown) __pthread_once_slow
https://symbolize.stripped_domain/r/?trace=7f7acf09500b,7f79a07c2a19,7f7acf09508f,7f79a4342bff,7f79a...
E0513 16:35:49.741066 317591 coredump_hook.cc:414] RAW: Remote crash data gathering hook invoked.
E0513 16:35:49.741084 317591 coredump_hook.cc:453] RAW: Skipping coredump since rlimit was 0 at process start.
E0513 16:35:49.741108 317591 client.cc:278] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0513 16:35:49.741111 317591 coredump_hook.cc:512] RAW: Sending fingerprint to remote end.
E0513 16:35:49.741116 317591 coredump_socket.cc:120] RAW: Stat failed errno=2 on socket /var/google/services/logmanagerd/remote_coredump.socket
E0513 16:35:49.741129 317591 coredump_hook.cc:518] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] Missing crash reporting socket. Is the listener running?
E0513 16:35:49.741133 317591 coredump_hook.cc:580] RAW: Dumping core locally.
E0513 16:35:49.817174 317591 process_state.cc:784] RAW: Raising signal 6 with default behavior
Aborted (core dumped)

 

0 2 1,455