{"id":2198,"date":"2026-01-17T07:37:05","date_gmt":"2026-01-17T07:37:05","guid":{"rendered":"https:\/\/nicktailor.com\/tech-blog\/?p=2198"},"modified":"2026-01-17T07:38:41","modified_gmt":"2026-01-17T07:38:41","slug":"building-a-reusable-hpc-diagnostic-harness-for-numa-cpu-gpu-mpi-infiniband","status":"publish","type":"post","link":"https:\/\/nicktailor.com\/tech-blog\/building-a-reusable-hpc-diagnostic-harness-for-numa-cpu-gpu-mpi-infiniband\/","title":{"rendered":"Building a Reusable HPC Diagnostic Harness for NUMA, CPU, GPU, MPI &amp; InfiniBand"},"content":{"rendered":"\n<article>\n\n<p>\nWhen operating HPC and AI infrastructure at scale, performance issues are rarely caused by a single factor.\nThey are usually the result of subtle misalignment between CPU placement, NUMA locality, memory allocation,\naccelerator topology, or network fabric behaviour.\n<\/p>\n\n<p>\nThis post walks through how to build a <strong>reusable diagnostic harness<\/strong> that allows you to\nmethodically inspect these layers, collect evidence, and identify the real source of performance problems.\n<\/p>\n\n<hr \/>\n\n<h2>Why Diagnostics Matter in HPC Environments<\/h2>\n\n<p>\nModern HPC systems are complex.\nSchedulers manage CPU ownership, operating systems handle memory allocation,\napplications introduce their own behaviour, and accelerators depend heavily on topology.\n<\/p>\n\n<p>\nWithout proper diagnostics, it is easy to misattribute performance problems to applications,\nwhen the real issue lies in infrastructure alignment.\n<\/p>\n\n<hr \/>\n\n<h2>Design Goal<\/h2>\n\n<p>\nThe goal is simple:\n<\/p>\n\n<blockquote>\nOne reusable script where you update a small set of variables, plug in any workload,\nand receive a complete diagnostic log.\n<\/blockquote>\n\n<p>\nHere&#8217;s how we achieve that.\n<\/p>\n\n<hr \/>\n\n<h2>Reusable HPC Diagnostic Wrapper<\/h2>\n\n<p>\nBelow is a diagnostic wrapper script that can be reused across different workloads.\nOnly the variables at the top need to be changed.\n<\/p>\n\n<b>The script is only available to clients who hire me through my limited company at this time.<\/b>\n\n<hr \/>\n\n<h2>Example Script Output<\/h2>\n\n<p>\nWhen you run the diagnostic wrapper on a multi-NUMA HPC node with GPUs and InfiniBand, the complete output looks like this:\n<\/p>\n\n<pre><code>=== HPC DIAGNOSTIC RUN ===\nHost      : compute-node-42\nTimestamp : Sat Jan 17 14:32:01 UTC 2026\nCommand   : .\/app \n\n=== NUMA TOPOLOGY ===\nNUMA node(s):          2\nNUMA node0 CPU(s):     0-7\nNUMA node1 CPU(s):     8-15\n\nnumactl 2.0.14\n\n=== GPU TOPOLOGY ===\nindex, name, pci.bus_id, memory.total [MiB]\n0, NVIDIA A100-SXM4-80GB, 00000000:07:00.0, 81920 MiB\n1, NVIDIA A100-SXM4-80GB, 00000000:0B:00.0, 81920 MiB\n2, NVIDIA A100-SXM4-80GB, 00000000:48:00.0, 81920 MiB\n3, NVIDIA A100-SXM4-80GB, 00000000:4C:00.0, 81920 MiB\n\n=== GPU-NUMA AFFINITY ===\n        GPU0    GPU1    GPU2    GPU3    mlx5_0  CPU Affinity    NUMA Affinity\nGPU0     X      NV12    SYS     SYS     PIX     0-7             0\nGPU1    NV12     X      SYS     SYS     SYS     0-7             0\nGPU2    SYS     SYS      X      NV12    SYS     8-15            1\nGPU3    SYS     SYS     NV12     X      PIX     8-15            1\nmlx5_0  PIX     SYS     SYS     PIX      X\n\n=== INFINIBAND STATUS ===\nCA 'mlx5_0'\n    CA type: MT4123\n    Number of ports: 1\n    Firmware version: 20.31.1014\n    Hardware version: 0\n    Node GUID: 0x1070fd0300123456\n    System image GUID: 0x1070fd0300123456\n    Port 1:\n        State: Active\n        Physical state: LinkUp\n        Rate: 200\n        Base lid: 1\n        LMC: 0\n        SM lid: 1\n        Capability mask: 0x2651e848\n        Port GUID: 0x1070fd0300123456\n        Link layer: InfiniBand\n\n=== INFINIBAND LINK ===\nInfiniband device 'mlx5_0' port 1 status:\n    default gid:    fe80:0000:0000:0000:1070:fd03:0012:3456\n    base lid:       0x1\n    sm lid:         0x1\n    state:          4: ACTIVE\n    phys state:     5: LinkUp\n    rate:           200 Gb\/sec (4X HDR)\n    link_layer:     InfiniBand\n\n=== STARTING APPLICATION ===\n[compute-node-42:12345] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B\/.\/.\/.\/.\/.\/.\/.][.\/.\/.\/.\/.\/.\/.\/.]\n[compute-node-42:12346] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [.\/B\/.\/.\/.\/.\/.\/.][.\/.\/.\/.\/.\/.\/.\/.]\n[compute-node-42:12347] MCW rank 2 bound to socket 1[core 8[hwt 0]]: [.\/.\/.\/.\/.\/.\/.\/.][B\/.\/.\/.\/.\/.\/.\/.]\n[compute-node-42:12348] MCW rank 3 bound to socket 1[core 9[hwt 0]]: [.\/.\/.\/.\/.\/.\/.\/.][.\/B\/.\/.\/.\/.\/.\/.]\n\n=== NUMA POLICY ===\npolicy: bind\nphyscpubind: 8\nmembind: 1\n\n=== CPU AFFINITY ===\npid 12345's current affinity list: 0\npid 12346's current affinity list: 1\npid 12347's current affinity list: 8\npid 12348's current affinity list: 9\n  PID PSR COMMAND\n12345   0 app\n12346   1 app\n12347   8 app\n12348   9 app\n\n=== NUMA MEMORY STATS ===\nPer-node process memory usage (in MBs) for PID 12345 (app)\n                           Node 0          Node 1           Total\n                  --------------- --------------- ---------------\nHuge                         0.00            0.00            0.00\nHeap                       128.00            0.00          128.00\nStack                        0.12            0.00            0.12\nPrivate                   5120.00            0.00         5120.00\n----------------  --------------- --------------- ---------------\nTotal                     5248.12            0.00         5248.12\n\n=== GPU UTILISATION ===\nindex, utilization.gpu [%], utilization.memory [%], memory.used [MiB], memory.free [MiB], temperature.gpu\n0, 87 %, 45 %, 36864 MiB, 45056 MiB, 62\n1, 0 %, 0 %, 0 MiB, 81920 MiB, 34\n2, 92 %, 52 %, 42240 MiB, 39680 MiB, 65\n3, 0 %, 0 %, 0 MiB, 81920 MiB, 35\n\n=== GPU PROCESS LIST ===\npid, process_name, gpu_bus_id, used_gpu_memory [MiB]\n12345, app, 00000000:07:00.0, 36864 MiB\n12347, app, 00000000:48:00.0, 42240 MiB\n\n=== INFINIBAND COUNTERS ===\n# Port extended counters: Lid 1 port 1 (CapMask: 0x5300)\nPortXmitData:.....................124587623456\nPortRcvData:......................118745236789\nPortXmitPkts:.....................45678912\nPortRcvPkts:......................43215678\nPortUnicastXmitPkts:..............45678900\nPortUnicastRcvPkts:...............43215666\nPortMulticastXmitPkts:............12\nPortMulticastRcvPkts:.............12\n\n=== COMPLETE ===\nRuntime (s): 47\n<\/code><\/pre>\n\n<p>\nThis single log file captures everything you need to verify correct infrastructure alignment across CPU, memory, GPU, and network fabric.\nThe following sections explain how to interpret each part of this output.\n<\/p>\n\n<hr \/>\n\n<h2>Interpreting the Diagnostic Output<\/h2>\n\n<p>\nEach section of the output tells you something specific about how your workload is interacting\nwith the underlying hardware. Here&#8217;s how to read each one.\n<\/p>\n\n<h3>NUMA Binding (numactl &#8211;show)<\/h3>\n\n<p><strong>Good output:<\/strong><\/p>\n<pre><code>policy: bind\nphyscpubind: 8\nmembind: 1\n<\/code><\/pre>\n\n<p>\nThis confirms that the process is pinned to CPU 8 and all memory allocations are restricted\nto NUMA node 1.\n<\/p>\n\n<p><strong>Bad output:<\/strong><\/p>\n<pre><code>policy: default\nphyscpubind: 8\nmembind: 0 1\n<\/code><\/pre>\n\n<p>\nMemory is being allocated across multiple NUMA nodes, resulting in cross-socket access,\nhigher latency, and unstable performance.\n<\/p>\n\n<hr \/>\n\n<h3>NUMA Memory Locality (numastat -p)<\/h3>\n\n<p><strong>Good output:<\/strong><\/p>\n<pre><code>Per-node process memory usage (MB)\nNode 0:      0\nNode 1:  10240\n<\/code><\/pre>\n\n<p>\nAll memory usage is local to the NUMA node where the process is running.\nThis is the expected and optimal behaviour.\n<\/p>\n\n<p><strong>Bad output:<\/strong><\/p>\n<pre><code>Per-node process memory usage (MB)\nNode 0:   4096\nNode 1:   6144\n<\/code><\/pre>\n\n<p>\nMemory is split across NUMA nodes.\nThis commonly leads to unpredictable runtimes, MPI slowdowns, and reduced GPU efficiency.\n<\/p>\n\n<hr \/>\n\n<h3>CPU Affinity (ps \/ taskset)<\/h3>\n\n<p><strong>Good output:<\/strong><\/p>\n<pre><code>PID   PSR  COMMAND\n1234   8   app\n\npid 1234's current affinity list: 8\n<\/code><\/pre>\n\n<p>\nThe process remains on the intended CPU and does not migrate between cores.\nCache locality is preserved.\n<\/p>\n\n<p><strong>Bad output:<\/strong><\/p>\n<pre><code>PID   PSR  COMMAND\n1234   3   app\n\npid 1234's current affinity list: 0-15\n<\/code><\/pre>\n\n<p>\nThe process has migrated to a different CPU.\nThis usually indicates missing or ineffective CPU binding.\n<\/p>\n\n<hr \/>\n\n<h3>GPU-NUMA Affinity (nvidia-smi topo -m)<\/h3>\n\n<p><strong>Good output:<\/strong><\/p>\n<pre><code>        GPU0    GPU1    mlx5_0  CPU Affinity    NUMA Affinity\nGPU0     X      NV12    PIX     0-7             0\nGPU1    NV12     X      SYS     0-7             0\nmlx5_0  PIX     SYS      X\n<\/code><\/pre>\n\n<p>\nThis shows GPU0 and the InfiniBand adapter (mlx5_0) share the same PCIe switch (PIX), meaning\nGPU-to-network transfers bypass the CPU entirely. Both GPUs are local to NUMA node 0.\n<\/p>\n\n<p><strong>Bad output:<\/strong><\/p>\n<pre><code>        GPU0    GPU1    mlx5_0  CPU Affinity    NUMA Affinity\nGPU0     X      SYS     SYS     0-7             0\nGPU1    SYS      X      SYS     8-15            1\nmlx5_0  SYS     SYS      X\n<\/code><\/pre>\n\n<p>\nAll devices are connected via SYS (system\/QPI), meaning every GPU-to-GPU and GPU-to-network\ntransfer must traverse the CPU interconnect. This adds latency and consumes memory bandwidth.\n<\/p>\n\n<p>\nKey topology indicators:\n<\/p>\n<ul>\n    <li><strong>NV#<\/strong> \u2014 NVLink connection (fastest GPU-to-GPU)<\/li>\n    <li><strong>PIX<\/strong> \u2014 Same PCIe switch (fast, CPU-bypass)<\/li>\n    <li><strong>PXB<\/strong> \u2014 Same PCIe bridge (good)<\/li>\n    <li><strong>SYS<\/strong> \u2014 Crosses CPU\/QPI (slowest, avoid for latency-sensitive workloads)<\/li>\n<\/ul>\n\n<hr \/>\n\n<h3>GPU Utilisation (nvidia-smi)<\/h3>\n\n<p><strong>Good output:<\/strong><\/p>\n<pre><code>index, utilization.gpu [%], memory.used [MiB], temperature.gpu\n0, 95 %, 72000 MiB, 68\n<\/code><\/pre>\n\n<p>\nGPU is highly utilised, memory is well allocated, and temperature is within operating range.\nThe workload is GPU-bound as expected.\n<\/p>\n\n<p><strong>Bad output:<\/strong><\/p>\n<pre><code>index, utilization.gpu [%], memory.used [MiB], temperature.gpu\n0, 12 %, 8000 MiB, 42\n<\/code><\/pre>\n\n<p>\nLow GPU utilisation with minimal memory usage suggests the workload is CPU-bound or\nwaiting on I\/O. Check for data loading bottlenecks, CPU preprocessing stalls, or\nincorrect batch sizes.\n<\/p>\n\n<hr \/>\n\n<h3>InfiniBand Status (ibstat \/ ibstatus)<\/h3>\n\n<p><strong>Good output:<\/strong><\/p>\n<pre><code>Port 1:\n    State: Active\n    Physical state: LinkUp\n    Rate: 200\n<\/code><\/pre>\n\n<p>\nThe InfiniBand port is active, physically connected, and running at expected speed (200 Gb\/s HDR).\n<\/p>\n\n<p><strong>Bad output:<\/strong><\/p>\n<pre><code>Port 1:\n    State: Down\n    Physical state: Polling\n    Rate: 10\n<\/code><\/pre>\n\n<p>\nThe port is not connected or is negotiating at a much lower speed. Check cables,\nswitch configuration, and subnet manager status.\n<\/p>\n\n<p>\nCommon link states:\n<\/p>\n<ul>\n    <li><strong>Active \/ LinkUp<\/strong> \u2014 Normal operation<\/li>\n    <li><strong>Init \/ LinkUp<\/strong> \u2014 Waiting for subnet manager<\/li>\n    <li><strong>Down \/ Polling<\/strong> \u2014 No physical connection or cable fault<\/li>\n    <li><strong>Armed<\/strong> \u2014 Link trained but not yet activated<\/li>\n<\/ul>\n\n<hr \/>\n\n<h3>InfiniBand Counters (perfquery)<\/h3>\n\n<p><strong>Good output:<\/strong><\/p>\n<pre><code>PortXmitData:.....................124587623456\nPortRcvData:......................118745236789\nPortXmitPkts:.....................45678912\nPortRcvPkts:......................43215678\n<\/code><\/pre>\n\n<p>\nData is flowing in both directions with balanced transmit and receive counts.\n<\/p>\n\n<p><strong>Bad output:<\/strong><\/p>\n<pre><code>PortXmitData:.....................124587623456\nPortRcvData:......................0\nSymbolErrorCounter:...............4521\nLinkDownedCounter:................12\n<\/code><\/pre>\n\n<p>\nZero receive data with symbol errors and link-down events indicates cable or\ntransceiver problems. Physical layer inspection is required.\n<\/p>\n\n<p>\nKey counters to watch:\n<\/p>\n<ul>\n    <li><strong>SymbolErrorCounter<\/strong> \u2014 Bit errors on the wire (should be 0)<\/li>\n    <li><strong>LinkDownedCounter<\/strong> \u2014 Link reset events (should be 0 during operation)<\/li>\n    <li><strong>PortRcvErrors<\/strong> \u2014 Malformed packets received<\/li>\n    <li><strong>PortXmitDiscards<\/strong> \u2014 Packets dropped due to congestion<\/li>\n<\/ul>\n\n<hr \/>\n\n<h3>MPI Rank Binding (&#8211;report-bindings)<\/h3>\n\n<p><strong>Good output:<\/strong><\/p>\n<pre><code>[node:12345] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B\/.\/.\/.\/.\/.\/.\/.][.\/.\/.\/.\/.\/.\/.\/.]\n[node:12346] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [.\/B\/.\/.\/.\/.\/.\/.][.\/.\/.\/.\/.\/.\/.\/.]\n[node:12347] MCW rank 2 bound to socket 1[core 8[hwt 0]]: [.\/.\/.\/.\/.\/.\/.\/.][B\/.\/.\/.\/.\/.\/.\/.]\n[node:12348] MCW rank 3 bound to socket 1[core 9[hwt 0]]: [.\/.\/.\/.\/.\/.\/.\/.][.\/B\/.\/.\/.\/.\/.\/.]\n<\/code><\/pre>\n\n<p>\nEach MPI rank is bound to a specific core, distributed evenly across NUMA nodes.\nThe B indicates where each rank is pinned.\n<\/p>\n\n<p><strong>Bad output:<\/strong><\/p>\n<pre><code>[node:12345] MCW rank 0 is not bound (or bound to all available processors)\n[node:12346] MCW rank 1 is not bound (or bound to all available processors)\n[node:12347] MCW rank 2 is not bound (or bound to all available processors)\n[node:12348] MCW rank 3 is not bound (or bound to all available processors)\n<\/code><\/pre>\n\n<p>\nMPI ranks are floating across all CPUs. This causes cache thrashing, cross-NUMA\nmemory access, and inconsistent performance. Add <code>--bind-to core<\/code> to your\nmpirun command.\n<\/p>\n\n<hr \/>\n\n<h2>Diagnosing the Root Cause<\/h2>\n\n<p>\nBy comparing good and bad outputs, we can narrow down the root cause:\n<\/p>\n\n<ul>\n  <li><strong>Cross-NUMA memory allocation<\/strong> \u2014 indicates locality problems, often caused by missing <code>--membind<\/code> or memory allocated before binding was applied<\/li>\n  <li><strong>CPU migration<\/strong> \u2014 points to missing or overridden affinity, commonly from scheduler interference or missing <code>--physcpubind<\/code><\/li>\n  <li><strong>Low GPU utilisation<\/strong> \u2014 suggests CPU bottleneck, data loading stalls, or incorrect CUDA device selection<\/li>\n  <li><strong>GPU-NUMA mismatch<\/strong> \u2014 process running on wrong NUMA node relative to GPU, causing PCIe traffic to cross CPU socket<\/li>\n  <li><strong>SYS topology between GPU and NIC<\/strong> \u2014 GPU-direct RDMA will underperform; consider workload placement or hardware topology changes<\/li>\n  <li><strong>InfiniBand errors<\/strong> \u2014 physical layer problems requiring cable, transceiver, or switch port inspection<\/li>\n  <li><strong>Unbound MPI ranks<\/strong> \u2014 missing binding flags causing rank migration and cache invalidation<\/li>\n  <li><strong>High runtime variance<\/strong> \u2014 usually correlates with topology misalignment and can be confirmed by checking the above metrics across multiple runs<\/li>\n<\/ul>\n\n<p>\nThis comparison-driven approach removes guesswork and makes infrastructure-level\nissues easy to identify and prove.\n<\/p>\n\n<hr \/>\n\n<h2>My Thoughts<\/h2>\n\n<p>\nWhen running HPC systems you need to diagnose with more information to help you figure out where the problem lies.\n<\/p>\n\n<p>\nCollecting CPU placement, NUMA locality, memory allocation, GPU topology, InfiniBand status,\nand MPI binding together allows you to methodically narrow down the root cause instead of guessing.\n<\/p>\n\n<p>\nWhen these signals line up, performance is predictable and consistent.\nWhen they do not, the logs will usually tell you exactly what is wrong.\n<\/p>\n\n<\/article>\n","protected":false},"excerpt":{"rendered":"<p>When operating HPC and AI infrastructure at scale, performance issues are rarely caused by a single factor. They are usually the result of subtle misalignment between CPU placement, NUMA locality, memory allocation, accelerator topology, or network fabric behaviour. This post walks through how to build a reusable diagnostic harness that allows you to methodically inspect these layers, collect evidence, and<a href=\"https:\/\/nicktailor.com\/tech-blog\/building-a-reusable-hpc-diagnostic-harness-for-numa-cpu-gpu-mpi-infiniband\/\" class=\"read-more\">Read More &#8230;<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[143],"tags":[],"class_list":["post-2198","post","type-post","status-publish","format-standard","hentry","category-hpc"],"_links":{"self":[{"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/posts\/2198","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/comments?post=2198"}],"version-history":[{"count":4,"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/posts\/2198\/revisions"}],"predecessor-version":[{"id":2202,"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/posts\/2198\/revisions\/2202"}],"wp:attachment":[{"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/media?parent=2198"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/categories?post=2198"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/tags?post=2198"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}