Sunday, May 10, 2020

Solving AWS ParallelCluster Cannot Submit Multiple Node using Slurm + OpenMPI

Solving AWS ParallelCluster Cannot Submit Multiple Node using Slurm + OpenMPI

Recently, I have tried out AWS ParallelCluster which is a Linux based HPC cluster solution. We use Slurm as the scheduler and OpenMPI. When submit jobs to multiple compute, it has various error messages, below is one version of it.

[ip-10-0-19-27][[16152,1],0][btl_tcp_endpoint.c:626:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[16152,1],1]
[ip-10-0-19-27][[16152,1],1][btl_tcp_endpoint.c:626:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[16152,1],0]
[ip-10-0-19-27][[16152,1],2][btl_tcp_endpoint.c:626:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[16152,1],3]
[ip-10-0-19-27][[16152,1],3][btl_tcp_endpoint.c:626:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[16152,1],2]
[ip-10-0-20-194][[16152,1],4][btl_tcp_endpoint.c:626:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[16152,1],5]
[ip-10-0-20-194][[16152,1],5][btl_tcp_endpoint.c:626:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[16152,1],4]
[ip-10-0-20-194][[16152,1],6][btl_tcp_endpoint.c:626:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[16152,1],7]
[ip-10-0-20-194][[16152,1],7][btl_tcp_endpoint.c:626:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[16152,1],6]

It turns out that OpenMPI somehow did not find the network interface. Adding –mca btl_tcp_if_include ens3 command line parameter to mpirun will solve the problem. Here ens3 is the default network interface. You could find it using ifconfig.

Below is an sample submission script.

#!/bin/bash
#SBATCH --job-name=montecarlojob
#SBATCH --ntasks=8
#SBATCH --output=%x_%j.out
module load openmpi
mpirun --mca btl_tcp_if_include ens3 -np 8 a.out

No comments:

Post a Comment