When operating HPC and AI infrastructure at scale, performance issues are rarely caused by a single factor. They are usually the result of subtle misalignment between CPU placement, NUMA locality, memory allocation, accelerator topology, or network fabric behaviour.
This post walks through how to build a reusable diagnostic harness that allows you to methodically inspect these layers, collect evidence, and identify the real source of performance problems.
Why Diagnostics Matter in HPC Environments
Modern HPC systems are complex. Schedulers manage CPU ownership, operating systems handle memory allocation, applications introduce their own behaviour, and accelerators depend heavily on topology.
Without proper diagnostics, it is easy to misattribute performance problems to applications, when the real issue lies in infrastructure alignment.
Design Goal
The goal is simple:
One reusable script where you update a small set of variables, plug in any workload, and receive a complete diagnostic log.
Here’s how we achieve that.
Reusable HPC Diagnostic Wrapper
Below is a diagnostic wrapper script that can be reused across different workloads. Only the variables at the top need to be changed.
The script is only available to clients who hire me through my limited company at this time.Example Script Output
When you run the diagnostic wrapper on a multi-NUMA HPC node with GPUs and InfiniBand, the complete output looks like this:
=== HPC DIAGNOSTIC RUN ===
Host : compute-node-42
Timestamp : Sat Jan 17 14:32:01 UTC 2026
Command : ./app
=== NUMA TOPOLOGY ===
NUMA node(s): 2
NUMA node0 CPU(s): 0-7
NUMA node1 CPU(s): 8-15
numactl 2.0.14
=== GPU TOPOLOGY ===
index, name, pci.bus_id, memory.total [MiB]
0, NVIDIA A100-SXM4-80GB, 00000000:07:00.0, 81920 MiB
1, NVIDIA A100-SXM4-80GB, 00000000:0B:00.0, 81920 MiB
2, NVIDIA A100-SXM4-80GB, 00000000:48:00.0, 81920 MiB
3, NVIDIA A100-SXM4-80GB, 00000000:4C:00.0, 81920 MiB
=== GPU-NUMA AFFINITY ===
GPU0 GPU1 GPU2 GPU3 mlx5_0 CPU Affinity NUMA Affinity
GPU0 X NV12 SYS SYS PIX 0-7 0
GPU1 NV12 X SYS SYS SYS 0-7 0
GPU2 SYS SYS X NV12 SYS 8-15 1
GPU3 SYS SYS NV12 X PIX 8-15 1
mlx5_0 PIX SYS SYS PIX X
=== INFINIBAND STATUS ===
CA 'mlx5_0'
CA type: MT4123
Number of ports: 1
Firmware version: 20.31.1014
Hardware version: 0
Node GUID: 0x1070fd0300123456
System image GUID: 0x1070fd0300123456
Port 1:
State: Active
Physical state: LinkUp
Rate: 200
Base lid: 1
LMC: 0
SM lid: 1
Capability mask: 0x2651e848
Port GUID: 0x1070fd0300123456
Link layer: InfiniBand
=== INFINIBAND LINK ===
Infiniband device 'mlx5_0' port 1 status:
default gid: fe80:0000:0000:0000:1070:fd03:0012:3456
base lid: 0x1
sm lid: 0x1
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 200 Gb/sec (4X HDR)
link_layer: InfiniBand
=== STARTING APPLICATION ===
[compute-node-42:12345] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././.][./././././././.]
[compute-node-42:12346] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./././././.][./././././././.]
[compute-node-42:12347] MCW rank 2 bound to socket 1[core 8[hwt 0]]: [./././././././.][B/././././././.]
[compute-node-42:12348] MCW rank 3 bound to socket 1[core 9[hwt 0]]: [./././././././.][./B/./././././.]
=== NUMA POLICY ===
policy: bind
physcpubind: 8
membind: 1
=== CPU AFFINITY ===
pid 12345's current affinity list: 0
pid 12346's current affinity list: 1
pid 12347's current affinity list: 8
pid 12348's current affinity list: 9
PID PSR COMMAND
12345 0 app
12346 1 app
12347 8 app
12348 9 app
=== NUMA MEMORY STATS ===
Per-node process memory usage (in MBs) for PID 12345 (app)
Node 0 Node 1 Total
--------------- --------------- ---------------
Huge 0.00 0.00 0.00
Heap 128.00 0.00 128.00
Stack 0.12 0.00 0.12
Private 5120.00 0.00 5120.00
---------------- --------------- --------------- ---------------
Total 5248.12 0.00 5248.12
=== GPU UTILISATION ===
index, utilization.gpu [%], utilization.memory [%], memory.used [MiB], memory.free [MiB], temperature.gpu
0, 87 %, 45 %, 36864 MiB, 45056 MiB, 62
1, 0 %, 0 %, 0 MiB, 81920 MiB, 34
2, 92 %, 52 %, 42240 MiB, 39680 MiB, 65
3, 0 %, 0 %, 0 MiB, 81920 MiB, 35
=== GPU PROCESS LIST ===
pid, process_name, gpu_bus_id, used_gpu_memory [MiB]
12345, app, 00000000:07:00.0, 36864 MiB
12347, app, 00000000:48:00.0, 42240 MiB
=== INFINIBAND COUNTERS ===
# Port extended counters: Lid 1 port 1 (CapMask: 0x5300)
PortXmitData:.....................124587623456
PortRcvData:......................118745236789
PortXmitPkts:.....................45678912
PortRcvPkts:......................43215678
PortUnicastXmitPkts:..............45678900
PortUnicastRcvPkts:...............43215666
PortMulticastXmitPkts:............12
PortMulticastRcvPkts:.............12
=== COMPLETE ===
Runtime (s): 47
This single log file captures everything you need to verify correct infrastructure alignment across CPU, memory, GPU, and network fabric. The following sections explain how to interpret each part of this output.
Interpreting the Diagnostic Output
Each section of the output tells you something specific about how your workload is interacting with the underlying hardware. Here’s how to read each one.
NUMA Binding (numactl –show)
Good output:
policy: bind
physcpubind: 8
membind: 1
This confirms that the process is pinned to CPU 8 and all memory allocations are restricted to NUMA node 1.
Bad output:
policy: default
physcpubind: 8
membind: 0 1
Memory is being allocated across multiple NUMA nodes, resulting in cross-socket access, higher latency, and unstable performance.
NUMA Memory Locality (numastat -p)
Good output:
Per-node process memory usage (MB)
Node 0: 0
Node 1: 10240
All memory usage is local to the NUMA node where the process is running. This is the expected and optimal behaviour.
Bad output:
Per-node process memory usage (MB)
Node 0: 4096
Node 1: 6144
Memory is split across NUMA nodes. This commonly leads to unpredictable runtimes, MPI slowdowns, and reduced GPU efficiency.
CPU Affinity (ps / taskset)
Good output:
PID PSR COMMAND
1234 8 app
pid 1234's current affinity list: 8
The process remains on the intended CPU and does not migrate between cores. Cache locality is preserved.
Bad output:
PID PSR COMMAND
1234 3 app
pid 1234's current affinity list: 0-15
The process has migrated to a different CPU. This usually indicates missing or ineffective CPU binding.
GPU-NUMA Affinity (nvidia-smi topo -m)
Good output:
GPU0 GPU1 mlx5_0 CPU Affinity NUMA Affinity
GPU0 X NV12 PIX 0-7 0
GPU1 NV12 X SYS 0-7 0
mlx5_0 PIX SYS X
This shows GPU0 and the InfiniBand adapter (mlx5_0) share the same PCIe switch (PIX), meaning GPU-to-network transfers bypass the CPU entirely. Both GPUs are local to NUMA node 0.
Bad output:
GPU0 GPU1 mlx5_0 CPU Affinity NUMA Affinity
GPU0 X SYS SYS 0-7 0
GPU1 SYS X SYS 8-15 1
mlx5_0 SYS SYS X
All devices are connected via SYS (system/QPI), meaning every GPU-to-GPU and GPU-to-network transfer must traverse the CPU interconnect. This adds latency and consumes memory bandwidth.
Key topology indicators:
- NV# — NVLink connection (fastest GPU-to-GPU)
- PIX — Same PCIe switch (fast, CPU-bypass)
- PXB — Same PCIe bridge (good)
- SYS — Crosses CPU/QPI (slowest, avoid for latency-sensitive workloads)
GPU Utilisation (nvidia-smi)
Good output:
index, utilization.gpu [%], memory.used [MiB], temperature.gpu
0, 95 %, 72000 MiB, 68
GPU is highly utilised, memory is well allocated, and temperature is within operating range. The workload is GPU-bound as expected.
Bad output:
index, utilization.gpu [%], memory.used [MiB], temperature.gpu
0, 12 %, 8000 MiB, 42
Low GPU utilisation with minimal memory usage suggests the workload is CPU-bound or waiting on I/O. Check for data loading bottlenecks, CPU preprocessing stalls, or incorrect batch sizes.
InfiniBand Status (ibstat / ibstatus)
Good output:
Port 1:
State: Active
Physical state: LinkUp
Rate: 200
The InfiniBand port is active, physically connected, and running at expected speed (200 Gb/s HDR).
Bad output:
Port 1:
State: Down
Physical state: Polling
Rate: 10
The port is not connected or is negotiating at a much lower speed. Check cables, switch configuration, and subnet manager status.
Common link states:
- Active / LinkUp — Normal operation
- Init / LinkUp — Waiting for subnet manager
- Down / Polling — No physical connection or cable fault
- Armed — Link trained but not yet activated
InfiniBand Counters (perfquery)
Good output:
PortXmitData:.....................124587623456
PortRcvData:......................118745236789
PortXmitPkts:.....................45678912
PortRcvPkts:......................43215678
Data is flowing in both directions with balanced transmit and receive counts.
Bad output:
PortXmitData:.....................124587623456
PortRcvData:......................0
SymbolErrorCounter:...............4521
LinkDownedCounter:................12
Zero receive data with symbol errors and link-down events indicates cable or transceiver problems. Physical layer inspection is required.
Key counters to watch:
- SymbolErrorCounter — Bit errors on the wire (should be 0)
- LinkDownedCounter — Link reset events (should be 0 during operation)
- PortRcvErrors — Malformed packets received
- PortXmitDiscards — Packets dropped due to congestion
MPI Rank Binding (–report-bindings)
Good output:
[node:12345] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././.][./././././././.]
[node:12346] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./././././.][./././././././.]
[node:12347] MCW rank 2 bound to socket 1[core 8[hwt 0]]: [./././././././.][B/././././././.]
[node:12348] MCW rank 3 bound to socket 1[core 9[hwt 0]]: [./././././././.][./B/./././././.]
Each MPI rank is bound to a specific core, distributed evenly across NUMA nodes. The B indicates where each rank is pinned.
Bad output:
[node:12345] MCW rank 0 is not bound (or bound to all available processors)
[node:12346] MCW rank 1 is not bound (or bound to all available processors)
[node:12347] MCW rank 2 is not bound (or bound to all available processors)
[node:12348] MCW rank 3 is not bound (or bound to all available processors)
MPI ranks are floating across all CPUs. This causes cache thrashing, cross-NUMA
memory access, and inconsistent performance. Add --bind-to core to your
mpirun command.
Diagnosing the Root Cause
By comparing good and bad outputs, we can narrow down the root cause:
- Cross-NUMA memory allocation — indicates locality problems, often caused by missing
--membindor memory allocated before binding was applied - CPU migration — points to missing or overridden affinity, commonly from scheduler interference or missing
--physcpubind - Low GPU utilisation — suggests CPU bottleneck, data loading stalls, or incorrect CUDA device selection
- GPU-NUMA mismatch — process running on wrong NUMA node relative to GPU, causing PCIe traffic to cross CPU socket
- SYS topology between GPU and NIC — GPU-direct RDMA will underperform; consider workload placement or hardware topology changes
- InfiniBand errors — physical layer problems requiring cable, transceiver, or switch port inspection
- Unbound MPI ranks — missing binding flags causing rank migration and cache invalidation
- High runtime variance — usually correlates with topology misalignment and can be confirmed by checking the above metrics across multiple runs
This comparison-driven approach removes guesswork and makes infrastructure-level issues easy to identify and prove.
My Thoughts
When running HPC systems you need to diagnose with more information to help you figure out where the problem lies.
Collecting CPU placement, NUMA locality, memory allocation, GPU topology, InfiniBand status, and MPI binding together allows you to methodically narrow down the root cause instead of guessing.
When these signals line up, performance is predictable and consistent. When they do not, the logs will usually tell you exactly what is wrong.
