Hello,
I am trying to run mpi job on batch using this example (https://cloud.google.com/batch/docs/create-run-job-mpi-library) , but the job fail to ssh onto other nodes in the batch job.
08-20 19:32:28.604] [0] [ERROR] [mpiexec@72a94ad7947b] ui_cmd_cb (mpiexec/pmiserv_pmci.c:51): Launch proxy failed.
[08-20 19:32:28.604] [0] [ERROR] [mpiexec@72a94ad7947b] HYDT_dmxu_poll_wait_for_event (lib/tools/demux/demux_poll.c:76): callback returned error status
[08-20 19:32:28.604] [0] [ERROR] [mpiexec@72a94ad7947b] HYD_pmci_wait_for_completion (mpiexec/pmiserv_pmci.c:181): error waiting for event
[08-20 19:32:28.604] [0] [ERROR] [mpiexec@72a94ad7947b] main (mpiexec/mpiexec.c:247): process manager error waiting for completion
I ssh'ed into one of the batch nodes and tried to ssh into other node directly and it timed out.
[XXXX@XXXX-163ad-131c8976-e019-49ff0-group0-0-cn08 ~]$ cat /etc/cloudbatch-taskgroup-hosts
XXXX-163ad-131c8976-e019-49ff0-group0-0-cn08
XXXX-163ad-131c8976-e019-49ff0-group0-0-vmrr
above is the content of the $BATCH_HOST_FILE.
[XXXX@XXXX-163ad-131c8976-e019-49ff0-group0-0-cn08 ~]$ ping XXXX-163ad-131c8976-e019-49ff0-group0-0-cn08
PING XXXX-163ad-131c8976-e019-49ff0-group0-0-cn08.us-central1-f.c.xxxx.internal (10.128.0.147) 56(84) bytes of data.
64 bytes from XXXX-163ad-131c8976-e019-49ff0-group0-0-cn08.us-central1-f.c.xxxx.internal (10.128.0.147): icmp_seq=1 ttl=64 time=0.028 ms
64 bytes from XXXX-163ad-131c8976-e019-49ff0-group0-0-cn08.us-central1-f.c.xxxx.internal (10.128.0.147): icmp_seq=2 ttl=64 time=0.050 ms
you can ping or ssh into the same node
[XXXX@XXXX-163ad-131c8976-e019-49ff0-group0-0-cn08 ~]$ ping XXXX-163ad-131c8976-e019-49ff0-group0-0-vmrr
PING XXXX-163ad-131c8976-e019-49ff0-group0-0-vmrr.us-central1-f.c.xxxx.internal (10.128.0.121) 56(84) bytes of data.
--- XXXX-163ad-131c8976-e019-49ff0-group0-0-vmrr.us-central1-f.c.xxxx.internal ping statistics ---
47 packets transmitted, 0 received, 100% packet loss, time 45999ms
but ping/ssh timeout to different node. However, batch node could still resolve the ip address from hostnames.
let me know how I can fix this
Hi @sung_sb,
Welcome to Google Cloud Community!
The behavior you're facing is likely related to network connectivity between your Batch nodes.
I would recommend below steps to troubleshoot this:
true
to allow passwordless communication among the VM instances.I hope the above information is helpful.