Recently, I have tried out AWS ParallelCluster which is a Linux based HPC cluster solution. We use Slurm as the scheduler and OpenMPI. When submit jobs to multiple compute, it has various error messages, below is one version of it.
[ip-10-0-19-27][[16152,1],0][btl_tcp_endpoint.c:626:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[16152,1],1]
[ip-10-0-19-27][[16152,1],1][btl_tcp_endpoint.c:626:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[16152,1],0]
[ip-10-0-19-27][[16152,1],2][btl_tcp_endpoint.c:626:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[16152,1],3]
[ip-10-0-19-27][[16152,1],3][btl_tcp_endpoint.c:626:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[16152,1],2]
[ip-10-0-20-194][[16152,1],4][btl_tcp_endpoint.c:626:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[16152,1],5]
[ip-10-0-20-194][[16152,1],5][btl_tcp_endpoint.c:626:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[16152,1],4]
[ip-10-0-20-194][[16152,1],6][btl_tcp_endpoint.c:626:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[16152,1],7]
[ip-10-0-20-194][[16152,1],7][btl_tcp_endpoint.c:626:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[16152,1],6]
It turns out that OpenMPI somehow did not find the network interface. Adding –mca btl_tcp_if_include ens3 command line parameter to mpirun will solve the problem. Here ens3 is the default network interface. You could find it using ifconfig.
Below is an sample submission script.
#!/bin/bash
#SBATCH --job-name=montecarlojob
#SBATCH --ntasks=8
#SBATCH --output=%x_%j.out
module load openmpi
mpirun --mca btl_tcp_if_include ens3 -np 8 a.out
No comments:
Post a Comment