{"id":2100,"date":"2025-10-09T06:02:09","date_gmt":"2025-10-09T06:02:09","guid":{"rendered":"https:\/\/www.nicktailor.com\/?p=2100"},"modified":"2025-10-09T06:05:18","modified_gmt":"2025-10-09T06:05:18","slug":"a-practical-repeatable-workflow-for-nvidia-gpu-linux-clusters-slurm-k8s-or-bare-metal-to-pinpoint-whether-your-bottleneck-is-gpu-cpu-memory-bandwidth-or-network","status":"publish","type":"post","link":"https:\/\/nicktailor.com\/tech-blog\/a-practical-repeatable-workflow-for-nvidia-gpu-linux-clusters-slurm-k8s-or-bare-metal-to-pinpoint-whether-your-bottleneck-is-gpu-cpu-memory-bandwidth-or-network\/","title":{"rendered":"A practical, repeatable workflow for NVIDIA-GPU Linux clusters (Slurm\/K8s or bare-metal) to pinpoint whether your bottleneck is GPU, CPU, memory bandwidth, or network"},"content":{"rendered":"\n<div style=\"font-family: system-ui, -apple-system, Segoe UI, Roboto, Helvetica, Arial, sans-serif; line-height:1.6; color:#0f172a;\">\n  <h1 style=\"margin:0 0 0.6em;\">Profiling Playbook: Detect GPU\/CPU, Memory Bandwidth, and Network Bottlenecks<\/h1>\n  <p style=\"margin:0 0 1.2em;\">A practical, repeatable workflow for NVIDIA-GPU Linux clusters (Slurm\/K8s or bare-metal) to pinpoint whether your bottleneck is GPU, CPU, memory bandwidth, or network.<\/p>\n\n  <style>\n    \/* Inline component styles (safe for WP) *\/\n    .pp-section h2{margin:1.6em 0 0.6em;font-size:1.4em}\n    .pp-section h3{margin:1.2em 0 0.5em;font-size:1.15em}\n    .pp-note{background:#f8fafc;border:1px solid #e2e8f0;padding:0.8em 1em;border-radius:8px;margin:1em 0}\n    .pp-code{background:#0b1220;color:#e5e7eb;border-radius:10px;padding:1em;overflow:auto;border:1px solid #0f172a33}\n    .pp-code code,.pp-code pre{font-family: ui-monospace, SFMono-Regular, Menlo, Consolas, \"Liberation Mono\", monospace; font-size:0.9em; white-space:pre}\n    .pp-ul{margin:0.4em 0 1em 1.2em}\n    .pp-ul li{margin:0.25em 0}\n    .pp-table{width:100%;border-collapse:collapse;margin:1em 0;border:1px solid #e5e7eb}\n    .pp-table th,.pp-table td{border:1px solid #e5e7eb;padding:0.6em 0.7em;vertical-align:top}\n    .pp-table th{background:#f1f5f9;text-align:left}\n    .pp-kbd{background:#e2e8f0;border-radius:4px;padding:0.05em 0.35em;font-family:ui-monospace, SFMono-Regular, Menlo, Consolas, monospace;font-size:0.9em}\n    .pp-small{color:#475569;font-size:0.95em}\n  <\/style>\n\n  <div class=\"pp-section\">\n    <h2>0) Prep: Make the Test Reproducible<\/h2>\n    <ul class=\"pp-ul\">\n      <li><strong>Choose a workload<\/strong>: (a) your real training\/inference job, plus (b) a couple of microbenchmarks.<\/li>\n      <li><strong>Pin placement\/affinity<\/strong>: match production (same container, CUDA\/cuDNN, drivers, env vars, GPU\/CPU affinity).<\/li>\n      <li><strong>Record node info<\/strong>: driver, CUDA, GPU model, CPU model, NUMA, NIC, topology.<\/li>\n    <\/ul>\n    <div class=\"pp-code\"><pre><code>nvidia-smi; nvidia-smi topo -m\nlscpu; numactl --hardware<\/code><\/pre><\/div>\n  <\/div>\n\n  <div class=\"pp-section\">\n    <h2>1) GPU Profiling (Utilization, Kernels, Memory, Interconnect)<\/h2>\n    <h3>Quick Live View (low overhead)<\/h3>\n    <div class=\"pp-code\"><pre><code># 1s sampling: Power (p) Util (u) Clocks (c) Mem util (v) Enc\/Dec (e) PCIe\/NVLink (t)\nnvidia-smi dmon -s pucvmet\n\n# More fields, CSV:\nnvidia-smi --query-gpu=index,name,utilization.gpu,utilization.memory,clocks.sm,clocks.mem,power.draw,temperature.gpu,pcie.link.gen.current,pcie.link.width.current,clocks_throttle_reasons.active --format=csv -l 1<\/code><\/pre><\/div>\n    <div class=\"pp-note\">\n      <strong>What to notice<\/strong>\n      <ul class=\"pp-ul\">\n        <li><span class=\"pp-kbd\">utilization.gpu<\/span> ~ 0\u201340% while job is \u201cbusy\u201d \u2192 likely <strong>CPU or input (I\/O) bound<\/strong>.<\/li>\n        <li>High <em>memory util<\/em> + low SM util \u2192 <strong>global memory bandwidth bound<\/strong>.<\/li>\n        <li>Power below expected \/ throttling active \u2192 power\/thermal cap or app clocks.<\/li>\n        <li>PCIe gen\/width lower than expected \u2192 host-device transfer bottleneck.<\/li>\n      <\/ul>\n    <\/div>\n\n    <h3>Deep Timeline (Nsight Systems \u2192 find <em>where<\/em> time is spent)<\/h3>\n    <div class=\"pp-code\"><pre><code>nsys profile -t cuda,osrt,nvtx,mpi --sample=process-tree -o \/tmp\/trace \\\n    --export=sqlite python train.py\n# Open \/tmp\/trace.qdrep in Nsight Systems GUI, or analyze the sqlite export<\/code><\/pre><\/div>\n    <div class=\"pp-note\">\n      <strong>Look for<\/strong>:\n      <ul class=\"pp-ul\">\n        <li>Long CPU gaps before kernels \u2192 <strong>dataloader\/CPU stall<\/strong>.<\/li>\n        <li>CUDA memcpy \/ NCCL all-reduce dominating \u2192 <strong>I\/O or network bottleneck<\/strong>.<\/li>\n        <li>Many short kernels with gaps \u2192 <strong>kernel launch overhead<\/strong> (try CUDA Graphs).<\/li>\n      <\/ul>\n    <\/div>\n\n    <h3>Kernel Efficiency (Nsight Compute \u2192 <em>why<\/em> GPU is slow)<\/h3>\n    <div class=\"pp-code\"><pre><code>ncu --set full --target-processes all -o \/tmp\/ncu python train.py\n# Then: ncu --import \/tmp\/ncu.ncu-rep --csv --page summary<\/code><\/pre><\/div>\n    <div class=\"pp-note\">\n      <strong>Signals<\/strong>:\n      <ul class=\"pp-ul\">\n        <li>Low\/achieved SM occupancy &amp; high <em>dram__throughput<\/em> vs arithmetic intensity \u2192 <strong>memory-bound kernels<\/strong>.<\/li>\n        <li>High barrier\/serialization \u2192 reformulate kernels or change backend.<\/li>\n      <\/ul>\n    <\/div>\n\n    <h3>NVLink \/ PCIe Health<\/h3>\n    <div class=\"pp-code\"><pre><code># NVLink counters (A100+\/NVSwitch)\nnvidia-smi nvlink -s\n# Topology sanity:\nnvidia-smi topo -m<\/code><\/pre><\/div>\n    <p class=\"pp-small\">If inter-GPU traffic stalls or retry errors climb, expect <strong>intra-node comms bottlenecks<\/strong>.<\/p>\n  <\/div>\n\n  <div class=\"pp-section\">\n    <h2>2) CPU &amp; Memory-Bandwidth Profiling (Host Side)<\/h2>\n    <h3>Fast CPU View<\/h3>\n    <div class=\"pp-code\"><pre><code>mpstat -P ALL 1\npidstat -u -r -d 1 -p $(pgrep -n python)   # CPU, RSS, I\/O per PID<\/code><\/pre><\/div>\n    <p class=\"pp-small\">\n      High CPU% &amp; run queue + GPU idle \u2192 <strong>CPU compute bound<\/strong> (augmentations, tokenization).<br \/>\n      Low CPU% &amp; waiting on I\/O + GPU idle \u2192 <strong>storage or network input bottleneck<\/strong>.\n    <\/p>\n\n    <h3>NUMA Locality (critical for feeders\/data loaders)<\/h3>\n    <div class=\"pp-code\"><pre><code>numactl -s\nnumastat -p $(pgrep -n python)  # remote vs local memory hits<\/code><\/pre><\/div>\n    <p class=\"pp-small\">Many <strong>remote<\/strong> hits \u2192 pin processes to closest NUMA node; bind NIC\/GPU affinity.<\/p>\n\n    <h3>Hardware Counters (perf) &amp; Memory Bandwidth<\/h3>\n    <div class=\"pp-code\"><pre><code># Whole process counters\nperf stat -d -p $(pgrep -n python) -- sleep 30\n\n# Hotspots (then open interactive report)\nperf record -F 99 -g -p $(pgrep -n python) -- sleep 30\nperf report<\/code><\/pre><\/div>\n    <p class=\"pp-small\">Low IPC + many L3\/mem stalls \u2192 <strong>memory bandwidth bound<\/strong> on CPU. Validate with STREAM \/ Intel PCM:<\/p>\n    <div class=\"pp-code\"><pre><code># STREAM (approximate host RAM BW)\nstream\n# Intel PCM memory (Intel CPUs)\npcm-memory 1<\/code><\/pre><\/div>\n  <\/div>\n\n  <div class=\"pp-section\">\n    <h2>3) Network Throughput\/Latency (Intra &amp; Inter-node)<\/h2>\n    <h3>Raw NIC Performance<\/h3>\n    <div class=\"pp-code\"><pre><code># TCP test (adjust -P for parallel flows)\niperf3 -s   # on server\niperf3 -c &lt;server&gt; -P 8 -t 30\n# For UDP or specific MTU\/Jumbo: use -u and set mtu via ip link\/ethtool<\/code><\/pre><\/div>\n    <p class=\"pp-small\">Compare results to NIC line-rate (e.g., 100\/200\/400GbE).<\/p>\n\n    <h3>RDMA \/ InfiniBand (if applicable)<\/h3>\n    <div class=\"pp-code\"><pre><code>ibstat; ibv_devinfo\nib_write_bw -d mlx5_0 -F -q 4 -l 512 -s 8388608 -D 30\nib_send_bw  -d mlx5_0 -F -q 4 -l 512 -s 8388608 -D 30<\/code><\/pre><\/div>\n    <p class=\"pp-small\">If RDMA BW\/latency is poor, check <strong>PFC\/ECN<\/strong>, RoCE config, and <strong>mtu 9000<\/strong> end-to-end.<\/p>\n\n    <h3>Collective (NCCL) Reality Check<\/h3>\n    <div class=\"pp-code\"><pre><code># From nccl-tests (build once)\n.\/build\/all_reduce_perf -b 8M -e 1G -f 2 -g 8   # intra-node\n# Multi-node (via mpirun or torchrun)<\/code><\/pre><\/div>\n    <p class=\"pp-small\">Throughput far below expectation \u2192 network path\/topology, or NCCL env (e.g., <span class=\"pp-kbd\">NCCL_IB<\/span>, <span class=\"pp-kbd\">NCCL_NET_GDR_LEVEL<\/span>, CollNet\/NVLS).<\/p>\n\n    <h3>NIC Counters \/ Driver<\/h3>\n    <div class=\"pp-code\"><pre><code>ethtool -S &lt;iface&gt; | egrep \"err|drop|disc|pause\"\nethtool -k &lt;iface&gt;   # offloads; ensure GRO\/LRO settings suit your stack<\/code><\/pre><\/div>\n    <p class=\"pp-small\">Growing errors\/pause frames \u2192 congestion, bad optics, or flow-control tuning.<\/p>\n  <\/div>\n\n  <div class=\"pp-section\">\n    <h2>4) Tie It Together with a Roofline View<\/h2>\n    <p>Compute intensity (FLOPs\/byte) vs achieved bandwidth quickly classifies memory-bound vs compute-bound. Use Nsight Compute\u2019s roofline page for kernels; for end-to-end, annotate steps with NVTX and view in Nsight Systems.<\/p>\n  <\/div>\n\n  <div class=\"pp-section\">\n    <h2>5) Microbenchmarks to Isolate Layers<\/h2>\n    <ul class=\"pp-ul\">\n      <li><strong>GPU math<\/strong>: HPL\/HPL-AI, cuBLAS GEMM runner, <em>nvidia\/cuda-samples<\/em> (matrixMulCUBLAS).<\/li>\n      <li><strong>Host RAM BW<\/strong>: STREAM.<\/li>\n      <li><strong>Disk I\/O<\/strong>: <span class=\"pp-kbd\">fio<\/span> (sequential vs random, queue depth).<\/li>\n      <li><strong>Network<\/strong>: <span class=\"pp-kbd\">iperf3<\/span>, <span class=\"pp-kbd\">ib_*_bw<\/span>, NCCL tests.<\/li>\n    <\/ul>\n    <p class=\"pp-small\">If microbenchmarks are fine but the real job isn\u2019t, the issue is <strong>software pipeline<\/strong> (dataloader, preprocessing, small batch, Python GIL, etc.).<\/p>\n  <\/div>\n\n  <div class=\"pp-section\">\n    <h2>6) Common Bottlenecks \u2192 Fixes<\/h2>\n    <table class=\"pp-table\">\n      <thead>\n        <tr>\n          <th>Symptom<\/th>\n          <th>Likely Bottleneck<\/th>\n          <th>Quick Fixes<\/th>\n        <\/tr>\n      <\/thead>\n      <tbody>\n        <tr>\n          <td>GPU util low, CPU busy<\/td>\n          <td><strong>CPU pipeline<\/strong><\/td>\n          <td>Increase workers\/prefetch, move aug to GPU (DALI), compile ops, pin threads\/NUMA.<\/td>\n        <\/tr>\n        <tr>\n          <td>High GPU mem util, SM low<\/td>\n          <td><strong>GPU mem-bound<\/strong><\/td>\n          <td>Fuse kernels, better tensor layouts, mixed precision (bf16\/fp16), larger batch if headroom.<\/td>\n        <\/tr>\n        <tr>\n          <td>NCCL all-reduce dominates<\/td>\n          <td><strong>Network<\/strong><\/td>\n          <td>Enable RDMA, tune NCCL env, jumbo MTU 9000, keep same switch tier, test CollNet\/NVLS.<\/td>\n        <\/tr>\n        <tr>\n          <td>memcpy HtoD heavy<\/td>\n          <td><strong>PCIe\/host I\/O<\/strong><\/td>\n          <td>Page-locked buffers, async prefetch, increase batch queue, ensure max PCIe Gen\/width.<\/td>\n        <\/tr>\n        <tr>\n          <td>Frequent GPU throttling<\/td>\n          <td><strong>Power\/Thermal<\/strong><\/td>\n          <td>Raise power limit (if safe), fix cooling, set application clocks, check throttling reasons.<\/td>\n        <\/tr>\n        <tr>\n          <td>Remote NUMA hits high<\/td>\n          <td><strong>NUMA<\/strong><\/td>\n          <td>Bind processes to local NUMA of GPU\/NIC, interleave wisely.<\/td>\n        <\/tr>\n      <\/tbody>\n    <\/table>\n  <\/div>\n\n  <div class=\"pp-section\">\n    <h2>7) Optional: One-Node Sampler Script<\/h2>\n    <p>Paste into <span class=\"pp-kbd\">profile.sh<\/span> and run <span class=\"pp-kbd\">bash profile.sh python train.py<\/span>.<\/p>\n    <div class=\"pp-code\"><pre><code>#!\/usr\/bin\/env bash\nset -euo pipefail\nAPP=\"$@\"  # e.g., python train.py\n\necho \"== System ==\"\nnvidia-smi --query-gpu=name,uuid,driver_version,pstate,pcie.link.gen.current,pcie.link.width.current --format=csv\nlscpu | egrep 'Model name|Socket|NUMA|Thread|MHz'\necho\n\necho \"== Start background samplers ==\"\n(nvidia-smi dmon -s pucvmet -d 1 &gt; \/tmp\/gpu_dmon.log) &amp;\nGPU_DMON_PID=$!\n(pidstat -u -r -d 1 &gt; \/tmp\/pidstat.log) &amp;\nPIDSTAT_PID=$!\n\necho \"== Run workload ==\"\n$APP || true\n\necho \"== Cleanup ==\"\nkill $GPU_DMON_PID $PIDSTAT_PID 2&gt;\/dev\/null || true\n\necho \"== Summaries ==\"\ntail -n +1 \/tmp\/gpu_dmon.log | head\ntail -n 20 \/tmp\/gpu_dmon.log\ntail -n 20 \/tmp\/pidstat.log<\/code><\/pre><\/div>\n  <\/div>\n\n  <div class=\"pp-section\">\n    <h2>8) HPE-Specific Checks (If Relevant)<\/h2>\n    <ul class=\"pp-ul\">\n      <li><strong>HPE iLO\/OneView<\/strong>: check thermal\/power capping, fan curves, PSU headroom.<\/li>\n      <li><strong>HPE Performance Cluster Manager \/ Cray<\/strong>: use built-in telemetry and fabric diagnostics.<\/li>\n      <li><strong>BIOS<\/strong>: Performance power profile, NUMA exposed, deterministic turbo, PCIe <strong>Gen4\/Gen5<\/strong>, Above 4G decoding on, SR-IOV\/ATS if virtualized.<\/li>\n    <\/ul>\n  <\/div>\n\n  <div class=\"pp-note\">\n    <strong>Need a tailored version?<\/strong> Tell me your GPU model(s), CPUs, NIC\/fabric, batch size\/model, and orchestration (Slurm\/K8s). I can generate a vendor-ready checklist and a Slurm job that auto-collects Nsight &amp; NCCL traces.\n  <\/div>\n<\/div>\n\n","protected":false},"excerpt":{"rendered":"<p>Profiling Playbook: Detect GPU\/CPU, Memory Bandwidth, and Network Bottlenecks A practical, repeatable workflow for NVIDIA-GPU Linux clusters (Slurm\/K8s or bare-metal) to pinpoint whether your bottleneck is GPU, CPU, memory bandwidth, or network. 0) Prep: Make the Test Reproducible Choose a workload: (a) your real training\/inference job, plus (b) a couple of microbenchmarks. Pin placement\/affinity: match production (same container, CUDA\/cuDNN, drivers,<a href=\"https:\/\/nicktailor.com\/tech-blog\/a-practical-repeatable-workflow-for-nvidia-gpu-linux-clusters-slurm-k8s-or-bare-metal-to-pinpoint-whether-your-bottleneck-is-gpu-cpu-memory-bandwidth-or-network\/\" class=\"read-more\">Read More &#8230;<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[143,138],"tags":[],"class_list":["post-2100","post","type-post","status-publish","format-standard","hentry","category-hpc","category-linux"],"_links":{"self":[{"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/posts\/2100","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/comments?post=2100"}],"version-history":[{"count":1,"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/posts\/2100\/revisions"}],"predecessor-version":[{"id":2101,"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/posts\/2100\/revisions\/2101"}],"wp:attachment":[{"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/media?parent=2100"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/categories?post=2100"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/tags?post=2100"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}