Troubleshooting InfiniBand Connections
The following is a brief troubleshooting guide for an InfiniBand network found in common HPC Linux clusters. Running these commands requires OFED 1.5.2 or later package installed on your systems. Additionally, the “pdsh” (parallel shell) command is part of the HP CMU cluster management software (version 4.2.1 used in our example) installed on the head node. If you don’t have CMU installed, below you will find a simple scripted alternative for running commands on multiple cluster nodes. In our cluster compute nodes are named “node1″ through “node32″.
Identify hardware module used for the IB interface:
|
1 |
ls /sys/class/infiniband |
Sample output:
|
1 |
mlx4_0 |
Check the state of the IB port:
|
1 |
cat /sys/class/infiniband/mlx4_0/ports/1/state |
Sample output:
|
1 |
4: ACTIVE |
Check the status of the subnet manager and start it if necessary:
|
1 2 3 4 |
if [ `/etc/init.d/opensmd status | grep -c "not running"` -gt 0 ] then /etc/init.d/opensmd start fi |
Sample output:
|
1 |
Starting opensm: done |
Check the state of the IB port on the compute nodes:
|
1 |
pdsh –w node[1-32] cat /sys/class/infiniband/mlx4_0/ports/1/state |
Sample output:
|
1 2 3 4 5 6 |
node1: 4: ACTIVE node5: 4: ACTIVE node2: 4: ACTIVE node6: 4: ACTIVE node7: 4: ACTIVE ... |
Here’s a way of running the command above without pdsh:
|
1 2 3 4 5 6 7 |
i=1 while [ $i -le 32 ] do echo -n "node$i: " ssh node$i "cat /sys/class/infiniband/mlx4_0/ports/1/state" (( i = i + 1 )) done |
The next step is to check the speed of IB ports on the head node:
|
1 |
cat /sys/class/infiniband/mlx4_0/ports/1/rate |
Sample output:
|
1 |
40 Gb/sec (4X QDR) |
… and on the compute nodes:
|
1 |
pdsh –w node[1-32] cat /sys/class/infiniband/mlx4_0/ports/1/rate |
And here’s how you do this without pdsh:
|
1 2 3 4 5 6 7 |
i=1 while [ $i -le 32 ] do echo -n "node$i: " ssh node$i "cat /sys/class/infiniband/mlx4_0/ports/1/rate" (( i = i + 1 )) done |
More detailed analysis of the IB connection can be performed with the ibdiagnet command:
|
1 |
ibdiagnet -pc -c 1000 |
Sample output:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 |
Loading IBDIAGNET from: /usr/lib64/ibdiagnet1.5.4 -W- Topology file is not specified. Reports regarding cluster links will use direct routes. Loading IBDM from: /usr/lib64/ibdm1.5.4 -I- Using port 1 as the local port. -I- Discovering ... 39 nodes (3 Switches & 36 CA-s) discovered. -I--------------------------------------------------- -I- Bad Guids/LIDs Info -I--------------------------------------------------- -I- No bad Guids were found -I--------------------------------------------------- -I- Links With Logical State = INIT -I--------------------------------------------------- -I- No bad Links (with logical state = INIT) were found -I--------------------------------------------------- -I- General Device Info -I--------------------------------------------------- -I--------------------------------------------------- -I- PM Counters Info -I--------------------------------------------------- -W- lid=0x0001 guid=0x0008f10500200898 dev=23130 Port=4 Performance Monitor counter : Value symbol_error_counter : 0x4 (Increase by 4 during ibdiagnet scan.) -I--------------------------------------------------- -I- Fabric Partitions Report (see ibdiagnet.pkey for a full hosts list) -I--------------------------------------------------- -I- PKey:0x7fff Hosts:36 full:36 limited:0 -I--------------------------------------------------- -I- IPoIB Subnets Check -I--------------------------------------------------- -I- Subnet: IPv4 PKey:0x7fff QKey:0x00000b1b MTU:2048Byte rate:10Gbps SL:0x00 -W- Suboptimal rate for group. Lowest member rate:20Gbps > group-rate:10Gbps -I--------------------------------------------------- -I- Bad Links Info -I- No bad link were found -I--------------------------------------------------- ---------------------------------------------------------------- -I- Stages Status Report: STAGE Errors Warnings Bad GUIDs/LIDs Check 0 0 Link State Active Check 0 0 General Devices Info Report 0 0 Performance Counters Report 0 1 Partitions Check 0 0 IPoIB Subnets Check 0 1 Please see /tmp/ibdiagnet.log for complete log ---------------------------------------------------------------- -I- Done. Run time was 11 seconds. |
cat /sys/class/infiniband/mlx4_0/ports/1/lid
pdsh –w node[1-32] /sys/class/infiniband/mlx4_0/ports/1/lid
ibdiagnet –pc –c 1000
The final step is to check the error state of each port:
|
1 |
ibcheckerrors |
Sample output:
|
1 2 3 4 5 6 7 8 |
#warn: counter SymbolErrors = 35 (threshold 10) lid 1 port 255 #warn: counter RcvSwRelayErrors = 512 (threshold 100) lid 1 port 255 Error check on lid 1 (Voltaire 4036 - 36 QDR ports switch) port all: FAILED #warn: counter SymbolErrors = 35 (threshold 10) lid 1 port 4 Error check on lid 1 (Voltaire 4036 - 36 QDR ports switch) port 4: FAILED ## Summary: 39 nodes checked, 0 bad nodes found ## 134 ports checked, 1 ports have errors beyond threshold |
In our case, the “ibcheckerrors” command revealed a problem with the IB switch. This turned out to be a hardware problem and the switch needed to be replaced.

Pingback: The Infiniband troubleshooting quick reference | shocksolution.com