While one errors is not significant, it would be nice to be open to collect coredumps in one place so many times malfunctioning GPUs can be calculated from the lot.
2023-04-19 05:54:50.710228: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
59/7504 [..............................] - ETA: 2:48:34 - total_loss: 73.0354 - rec_loss: 0.0428 - perc_loss: 0.2867 - smooth_loss: 0.0362 - warping_loss: 0.6300 - psnr: 24.0297 - ssim: 0.6890
2023-04-19 05:59:55.261140: F ./tensorflow/core/kernels/conv_2d_gpu.h:537] Non-OK-status: GpuLaunchKernel(ShuffleInTensor3Simple<T, 2, 1, 0>, config.block_count, config.thread_per_block, 0, d.stream(), config.virtual_thread_count, in.data(), combined_dims, out.data()) status: Internal: an illegal memory access was encountered
2023-04-19 05:59:55.261148: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Aborted (core dumped)
(tf-cuda) rac@gpu:~$