Building a Reusable HPC Diagnostic Harness for NUMA, CPU, GPU, MPI & InfiniBand

When operating HPC and AI infrastructure at scale, performance issues are rarely caused by a single factor. They are usually the result of subtle misalignment between CPU placement, NUMA locality, memory allocation, accelerator topology, or network fabric behaviour.

This post walks through how to build a reusable diagnostic harness that allows you to methodically inspect these layers, collect evidence, and identify the real source of performance problems.


Why Diagnostics Matter in HPC Environments

Modern HPC systems are complex. Schedulers manage CPU ownership, operating systems handle memory allocation, applications introduce their own behaviour, and accelerators depend heavily on topology.

Without proper diagnostics, it is easy to misattribute performance problems to applications, when the real issue lies in infrastructure alignment.


Design Goal

The goal is simple:

One reusable script where you update a small set of variables, plug in any workload, and receive a complete diagnostic log.

Here’s how we achieve that.


Reusable HPC Diagnostic Wrapper

Below is a diagnostic wrapper script that can be reused across different workloads. Only the variables at the top need to be changed.

The script is only available to clients who hire me through my limited company at this time.

Example Script Output

When you run the diagnostic wrapper on a multi-NUMA HPC node with GPUs and InfiniBand, the complete output looks like this:

=== HPC DIAGNOSTIC RUN ===
Host      : compute-node-42
Timestamp : Sat Jan 17 14:32:01 UTC 2026
Command   : ./app 

=== NUMA TOPOLOGY ===
NUMA node(s):          2
NUMA node0 CPU(s):     0-7
NUMA node1 CPU(s):     8-15

numactl 2.0.14

=== GPU TOPOLOGY ===
index, name, pci.bus_id, memory.total [MiB]
0, NVIDIA A100-SXM4-80GB, 00000000:07:00.0, 81920 MiB
1, NVIDIA A100-SXM4-80GB, 00000000:0B:00.0, 81920 MiB
2, NVIDIA A100-SXM4-80GB, 00000000:48:00.0, 81920 MiB
3, NVIDIA A100-SXM4-80GB, 00000000:4C:00.0, 81920 MiB

=== GPU-NUMA AFFINITY ===
        GPU0    GPU1    GPU2    GPU3    mlx5_0  CPU Affinity    NUMA Affinity
GPU0     X      NV12    SYS     SYS     PIX     0-7             0
GPU1    NV12     X      SYS     SYS     SYS     0-7             0
GPU2    SYS     SYS      X      NV12    SYS     8-15            1
GPU3    SYS     SYS     NV12     X      PIX     8-15            1
mlx5_0  PIX     SYS     SYS     PIX      X

=== INFINIBAND STATUS ===
CA 'mlx5_0'
    CA type: MT4123
    Number of ports: 1
    Firmware version: 20.31.1014
    Hardware version: 0
    Node GUID: 0x1070fd0300123456
    System image GUID: 0x1070fd0300123456
    Port 1:
        State: Active
        Physical state: LinkUp
        Rate: 200
        Base lid: 1
        LMC: 0
        SM lid: 1
        Capability mask: 0x2651e848
        Port GUID: 0x1070fd0300123456
        Link layer: InfiniBand

=== INFINIBAND LINK ===
Infiniband device 'mlx5_0' port 1 status:
    default gid:    fe80:0000:0000:0000:1070:fd03:0012:3456
    base lid:       0x1
    sm lid:         0x1
    state:          4: ACTIVE
    phys state:     5: LinkUp
    rate:           200 Gb/sec (4X HDR)
    link_layer:     InfiniBand

=== STARTING APPLICATION ===
[compute-node-42:12345] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././.][./././././././.]
[compute-node-42:12346] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./././././.][./././././././.]
[compute-node-42:12347] MCW rank 2 bound to socket 1[core 8[hwt 0]]: [./././././././.][B/././././././.]
[compute-node-42:12348] MCW rank 3 bound to socket 1[core 9[hwt 0]]: [./././././././.][./B/./././././.]

=== NUMA POLICY ===
policy: bind
physcpubind: 8
membind: 1

=== CPU AFFINITY ===
pid 12345's current affinity list: 0
pid 12346's current affinity list: 1
pid 12347's current affinity list: 8
pid 12348's current affinity list: 9
  PID PSR COMMAND
12345   0 app
12346   1 app
12347   8 app
12348   9 app

=== NUMA MEMORY STATS ===
Per-node process memory usage (in MBs) for PID 12345 (app)
                           Node 0          Node 1           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                       128.00            0.00          128.00
Stack                        0.12            0.00            0.12
Private                   5120.00            0.00         5120.00
----------------  --------------- --------------- ---------------
Total                     5248.12            0.00         5248.12

=== GPU UTILISATION ===
index, utilization.gpu [%], utilization.memory [%], memory.used [MiB], memory.free [MiB], temperature.gpu
0, 87 %, 45 %, 36864 MiB, 45056 MiB, 62
1, 0 %, 0 %, 0 MiB, 81920 MiB, 34
2, 92 %, 52 %, 42240 MiB, 39680 MiB, 65
3, 0 %, 0 %, 0 MiB, 81920 MiB, 35

=== GPU PROCESS LIST ===
pid, process_name, gpu_bus_id, used_gpu_memory [MiB]
12345, app, 00000000:07:00.0, 36864 MiB
12347, app, 00000000:48:00.0, 42240 MiB

=== INFINIBAND COUNTERS ===
# Port extended counters: Lid 1 port 1 (CapMask: 0x5300)
PortXmitData:.....................124587623456
PortRcvData:......................118745236789
PortXmitPkts:.....................45678912
PortRcvPkts:......................43215678
PortUnicastXmitPkts:..............45678900
PortUnicastRcvPkts:...............43215666
PortMulticastXmitPkts:............12
PortMulticastRcvPkts:.............12

=== COMPLETE ===
Runtime (s): 47

This single log file captures everything you need to verify correct infrastructure alignment across CPU, memory, GPU, and network fabric. The following sections explain how to interpret each part of this output.


Interpreting the Diagnostic Output

Each section of the output tells you something specific about how your workload is interacting with the underlying hardware. Here’s how to read each one.

NUMA Binding (numactl –show)

Good output:

policy: bind
physcpubind: 8
membind: 1

This confirms that the process is pinned to CPU 8 and all memory allocations are restricted to NUMA node 1.

Bad output:

policy: default
physcpubind: 8
membind: 0 1

Memory is being allocated across multiple NUMA nodes, resulting in cross-socket access, higher latency, and unstable performance.


NUMA Memory Locality (numastat -p)

Good output:

Per-node process memory usage (MB)
Node 0:      0
Node 1:  10240

All memory usage is local to the NUMA node where the process is running. This is the expected and optimal behaviour.

Bad output:

Per-node process memory usage (MB)
Node 0:   4096
Node 1:   6144

Memory is split across NUMA nodes. This commonly leads to unpredictable runtimes, MPI slowdowns, and reduced GPU efficiency.


CPU Affinity (ps / taskset)

Good output:

PID   PSR  COMMAND
1234   8   app

pid 1234's current affinity list: 8

The process remains on the intended CPU and does not migrate between cores. Cache locality is preserved.

Bad output:

PID   PSR  COMMAND
1234   3   app

pid 1234's current affinity list: 0-15

The process has migrated to a different CPU. This usually indicates missing or ineffective CPU binding.


GPU-NUMA Affinity (nvidia-smi topo -m)

Good output:

        GPU0    GPU1    mlx5_0  CPU Affinity    NUMA Affinity
GPU0     X      NV12    PIX     0-7             0
GPU1    NV12     X      SYS     0-7             0
mlx5_0  PIX     SYS      X

This shows GPU0 and the InfiniBand adapter (mlx5_0) share the same PCIe switch (PIX), meaning GPU-to-network transfers bypass the CPU entirely. Both GPUs are local to NUMA node 0.

Bad output:

        GPU0    GPU1    mlx5_0  CPU Affinity    NUMA Affinity
GPU0     X      SYS     SYS     0-7             0
GPU1    SYS      X      SYS     8-15            1
mlx5_0  SYS     SYS      X

All devices are connected via SYS (system/QPI), meaning every GPU-to-GPU and GPU-to-network transfer must traverse the CPU interconnect. This adds latency and consumes memory bandwidth.

Key topology indicators:

  • NV# — NVLink connection (fastest GPU-to-GPU)
  • PIX — Same PCIe switch (fast, CPU-bypass)
  • PXB — Same PCIe bridge (good)
  • SYS — Crosses CPU/QPI (slowest, avoid for latency-sensitive workloads)

GPU Utilisation (nvidia-smi)

Good output:

index, utilization.gpu [%], memory.used [MiB], temperature.gpu
0, 95 %, 72000 MiB, 68

GPU is highly utilised, memory is well allocated, and temperature is within operating range. The workload is GPU-bound as expected.

Bad output:

index, utilization.gpu [%], memory.used [MiB], temperature.gpu
0, 12 %, 8000 MiB, 42

Low GPU utilisation with minimal memory usage suggests the workload is CPU-bound or waiting on I/O. Check for data loading bottlenecks, CPU preprocessing stalls, or incorrect batch sizes.


InfiniBand Status (ibstat / ibstatus)

Good output:

Port 1:
    State: Active
    Physical state: LinkUp
    Rate: 200

The InfiniBand port is active, physically connected, and running at expected speed (200 Gb/s HDR).

Bad output:

Port 1:
    State: Down
    Physical state: Polling
    Rate: 10

The port is not connected or is negotiating at a much lower speed. Check cables, switch configuration, and subnet manager status.

Common link states:

  • Active / LinkUp — Normal operation
  • Init / LinkUp — Waiting for subnet manager
  • Down / Polling — No physical connection or cable fault
  • Armed — Link trained but not yet activated

InfiniBand Counters (perfquery)

Good output:

PortXmitData:.....................124587623456
PortRcvData:......................118745236789
PortXmitPkts:.....................45678912
PortRcvPkts:......................43215678

Data is flowing in both directions with balanced transmit and receive counts.

Bad output:

PortXmitData:.....................124587623456
PortRcvData:......................0
SymbolErrorCounter:...............4521
LinkDownedCounter:................12

Zero receive data with symbol errors and link-down events indicates cable or transceiver problems. Physical layer inspection is required.

Key counters to watch:

  • SymbolErrorCounter — Bit errors on the wire (should be 0)
  • LinkDownedCounter — Link reset events (should be 0 during operation)
  • PortRcvErrors — Malformed packets received
  • PortXmitDiscards — Packets dropped due to congestion

MPI Rank Binding (–report-bindings)

Good output:

[node:12345] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././.][./././././././.]
[node:12346] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./././././.][./././././././.]
[node:12347] MCW rank 2 bound to socket 1[core 8[hwt 0]]: [./././././././.][B/././././././.]
[node:12348] MCW rank 3 bound to socket 1[core 9[hwt 0]]: [./././././././.][./B/./././././.]

Each MPI rank is bound to a specific core, distributed evenly across NUMA nodes. The B indicates where each rank is pinned.

Bad output:

[node:12345] MCW rank 0 is not bound (or bound to all available processors)
[node:12346] MCW rank 1 is not bound (or bound to all available processors)
[node:12347] MCW rank 2 is not bound (or bound to all available processors)
[node:12348] MCW rank 3 is not bound (or bound to all available processors)

MPI ranks are floating across all CPUs. This causes cache thrashing, cross-NUMA memory access, and inconsistent performance. Add --bind-to core to your mpirun command.


Diagnosing the Root Cause

By comparing good and bad outputs, we can narrow down the root cause:

  • Cross-NUMA memory allocation — indicates locality problems, often caused by missing --membind or memory allocated before binding was applied
  • CPU migration — points to missing or overridden affinity, commonly from scheduler interference or missing --physcpubind
  • Low GPU utilisation — suggests CPU bottleneck, data loading stalls, or incorrect CUDA device selection
  • GPU-NUMA mismatch — process running on wrong NUMA node relative to GPU, causing PCIe traffic to cross CPU socket
  • SYS topology between GPU and NIC — GPU-direct RDMA will underperform; consider workload placement or hardware topology changes
  • InfiniBand errors — physical layer problems requiring cable, transceiver, or switch port inspection
  • Unbound MPI ranks — missing binding flags causing rank migration and cache invalidation
  • High runtime variance — usually correlates with topology misalignment and can be confirmed by checking the above metrics across multiple runs

This comparison-driven approach removes guesswork and makes infrastructure-level issues easy to identify and prove.


My Thoughts

When running HPC systems you need to diagnose with more information to help you figure out where the problem lies.

Collecting CPU placement, NUMA locality, memory allocation, GPU topology, InfiniBand status, and MPI binding together allows you to methodically narrow down the root cause instead of guessing.

When these signals line up, performance is predictable and consistent. When they do not, the logs will usually tell you exactly what is wrong.

Leave a Reply

Your email address will not be published. Required fields are marked *