Announcements
This site is in read only until July 22 as we migrate to a new platform; refer to this community post for more details.
Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Unable to run mpi on batch

Hello,

I am trying to run mpi job on batch using this example (https://cloud.google.com/batch/docs/create-run-job-mpi-library) , but the job fail to ssh onto other nodes in the batch job.

 

 

 

08-20 19:32:28.604] [0] [ERROR] [mpiexec@72a94ad7947b] ui_cmd_cb (mpiexec/pmiserv_pmci.c:51): Launch proxy failed.
[08-20 19:32:28.604] [0] [ERROR] [mpiexec@72a94ad7947b] HYDT_dmxu_poll_wait_for_event (lib/tools/demux/demux_poll.c:76): callback returned error status
[08-20 19:32:28.604] [0] [ERROR] [mpiexec@72a94ad7947b] HYD_pmci_wait_for_completion (mpiexec/pmiserv_pmci.c:181): error waiting for event
[08-20 19:32:28.604] [0] [ERROR] [mpiexec@72a94ad7947b] main (mpiexec/mpiexec.c:247): process manager error waiting for completion

 

 

 

I ssh'ed into one of the batch nodes and tried to ssh into other node directly and it timed out.

 

 

 

[XXXX@XXXX-163ad-131c8976-e019-49ff0-group0-0-cn08 ~]$ cat /etc/cloudbatch-taskgroup-hosts
XXXX-163ad-131c8976-e019-49ff0-group0-0-cn08
XXXX-163ad-131c8976-e019-49ff0-group0-0-vmrr

 

 

 

above is the content of the $BATCH_HOST_FILE.

 

 

 

[XXXX@XXXX-163ad-131c8976-e019-49ff0-group0-0-cn08 ~]$ ping XXXX-163ad-131c8976-e019-49ff0-group0-0-cn08
PING XXXX-163ad-131c8976-e019-49ff0-group0-0-cn08.us-central1-f.c.xxxx.internal (10.128.0.147) 56(84) bytes of data.
64 bytes from XXXX-163ad-131c8976-e019-49ff0-group0-0-cn08.us-central1-f.c.xxxx.internal (10.128.0.147): icmp_seq=1 ttl=64 time=0.028 ms
64 bytes from XXXX-163ad-131c8976-e019-49ff0-group0-0-cn08.us-central1-f.c.xxxx.internal (10.128.0.147): icmp_seq=2 ttl=64 time=0.050 ms

 

 

 

you can ping or ssh into the same node

 

 

 

[XXXX@XXXX-163ad-131c8976-e019-49ff0-group0-0-cn08 ~]$ ping XXXX-163ad-131c8976-e019-49ff0-group0-0-vmrr
PING XXXX-163ad-131c8976-e019-49ff0-group0-0-vmrr.us-central1-f.c.xxxx.internal (10.128.0.121) 56(84) bytes of data.
--- XXXX-163ad-131c8976-e019-49ff0-group0-0-vmrr.us-central1-f.c.xxxx.internal ping statistics ---
47 packets transmitted, 0 received, 100% packet loss, time 45999ms

 

 

 

but ping/ssh timeout to different node. However, batch node could still resolve the ip address from hostnames.

let me know how I can fix this

0 1 257
1 REPLY 1

jaydubu
Former Googler

Hi @sung_sb,

Welcome to Google Cloud Community!

The behavior you're facing is likely related to network connectivity between your Batch nodes.

I would recommend below steps to troubleshoot this:

I hope the above information is helpful.