Day: January 17, 2026

SLURM Accounting Setup; my personal notes

SLURM accounting tracks every job that runs on your cluster — who submitted it, what resources it used, how long it ran, and which account to bill. This data powers fairshare scheduling, resource limits, usage reports, and chargeback billing.

This post walks through setting up SLURM accounting from scratch in a production environment, with the database on a dedicated server separate from the controller.


Architecture Overview

In production, you separate the database from the controller for performance and reliability:

Controller Node        Database Node          Compute Nodes
───────────────        ─────────────          ─────────────
slurmctld              slurmdbd               slurmd
                       MariaDB/MySQL          slurmd
                                              slurmd
                                              ...

How it works:

  • slurmctld (scheduler) sends job data to slurmdbd
  • slurmdbd (database daemon) writes to MariaDB/MySQL
  • Compute nodes (slurmd) just run jobs — no database access

The controller never talks directly to the database. slurmdbd is the middleman that handles connection pooling, batches writes, and queues data if the database is temporarily unavailable.


Prerequisites

Before starting, ensure you have:

  • Working SLURM cluster (slurmctld on controller, slurmd on compute nodes)
  • Dedicated database server (can be VM or physical)
  • Network connectivity between controller and database server
  • Consistent SLURM user/group (UID/GID must match across all nodes)
  • Munge authentication working across all nodes

Step 1: Install MariaDB on Database Server

On your dedicated database server:

# Install MariaDB
sudo apt update
sudo apt install mariadb-server mariadb-client -y

# Start and enable
sudo systemctl start mariadb
sudo systemctl enable mariadb

# Secure installation
sudo mysql_secure_installation

During secure installation:

  • Set root password
  • Remove anonymous users — Yes
  • Disallow root login remotely — Yes
  • Remove test database — Yes
  • Reload privilege tables — Yes

Step 2: Create SLURM Database and User

Log into MariaDB and create the database:

sudo mysql -u root -p
-- Create database
CREATE DATABASE slurm_acct_db;

-- Create slurm user with access from controller node
CREATE USER 'slurm'@'controller.example.com' IDENTIFIED BY 'your_secure_password';

-- Grant privileges
GRANT ALL PRIVILEGES ON slurm_acct_db.* TO 'slurm'@'controller.example.com';

-- If slurmdbd runs on the database server itself (alternative setup)
-- CREATE USER 'slurm'@'localhost' IDENTIFIED BY 'your_secure_password';
-- GRANT ALL PRIVILEGES ON slurm_acct_db.* TO 'slurm'@'localhost';

FLUSH PRIVILEGES;
EXIT;

Step 3: Configure MariaDB for Remote Access

Edit MariaDB configuration to allow connections from the controller:

sudo nano /etc/mysql/mariadb.conf.d/50-server.cnf

Find and modify the bind-address:

# Change from
bind-address = 127.0.0.1

# To (listen on all interfaces)
bind-address = 0.0.0.0

# Or specific IP
bind-address = 192.168.1.10

Add performance settings for SLURM workload:

[mysqld]
bind-address = 0.0.0.0
innodb_buffer_pool_size = 1G
innodb_log_file_size = 64M
innodb_lock_wait_timeout = 900
max_connections = 200

Restart MariaDB:

sudo systemctl restart mariadb

Open firewall if needed:

# UFW
sudo ufw allow from 192.168.1.0/24 to any port 3306

# Or firewalld
sudo firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="192.168.1.0/24" port protocol="tcp" port="3306" accept'
sudo firewall-cmd --reload

Step 4: Install slurmdbd on Database Server

You can run slurmdbd on the database server or the controller. Running it on the database server keeps database traffic local.

# On database server
sudo apt install slurmdbd -y

Step 5: Configure slurmdbd

Create the slurmdbd configuration file:

sudo nano /etc/slurm/slurmdbd.conf
# slurmdbd.conf - SLURM Database Daemon Configuration

# Daemon settings
DbdHost=dbserver.example.com
DbdPort=6819
SlurmUser=slurm

# Logging
LogFile=/var/log/slurm/slurmdbd.log
PidFile=/run/slurmdbd.pid
DebugLevel=info

# Database connection
StorageType=accounting_storage/mysql
StorageHost=localhost
StoragePort=3306
StorageUser=slurm
StoragePass=your_secure_password
StorageLoc=slurm_acct_db

# Archive settings (optional)
#ArchiveEvents=yes
#ArchiveJobs=yes
#ArchiveResvs=yes
#ArchiveSteps=no
#ArchiveSuspend=no
#ArchiveTXN=no
#ArchiveUsage=no
#ArchiveScript=/usr/sbin/slurm.dbd.archive

# Purge old data (optional - keep 12 months)
#PurgeEventAfter=12months
#PurgeJobAfter=12months
#PurgeResvAfter=12months
#PurgeStepAfter=12months
#PurgeSuspendAfter=12months
#PurgeTXNAfter=12months
#PurgeUsageAfter=12months

Set proper permissions:

# slurmdbd.conf must be readable only by SlurmUser (contains password)
sudo chown slurm:slurm /etc/slurm/slurmdbd.conf
sudo chmod 600 /etc/slurm/slurmdbd.conf

# Create log directory
sudo mkdir -p /var/log/slurm
sudo chown slurm:slurm /var/log/slurm

Step 6: Start slurmdbd

Start the daemon and verify it connects to the database:

# Start slurmdbd
sudo systemctl start slurmdbd
sudo systemctl enable slurmdbd

# Check status
sudo systemctl status slurmdbd

# Check logs for errors
sudo tail -f /var/log/slurm/slurmdbd.log

Successful startup looks like:

slurmdbd: debug:  slurmdbd version 23.02.4 started
slurmdbd: debug:  Listening on 0.0.0.0:6819
slurmdbd: info:   Registering cluster(s) with database

Step 7: Configure slurmctld to Use Accounting

On your controller node, edit slurm.conf:

sudo nano /etc/slurm/slurm.conf

Add accounting configuration:

# Accounting settings
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=dbserver.example.com
AccountingStoragePort=6819
AccountingStorageEnforce=associations,limits,qos,safe

# Job completion logging
JobCompType=jobcomp/none
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=30

# Process tracking (required for accurate accounting)
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup,task/affinity

AccountingStorageEnforce options:

  • associations — Users must have valid account association to submit jobs
  • limits — Enforce resource limits set on accounts/users
  • qos — Enforce Quality of Service settings
  • safe — Only allow jobs that can run within limits

Step 8: Open Firewall for slurmdbd

On the database server, allow connections from the controller:

# UFW
sudo ufw allow from 192.168.1.0/24 to any port 6819

# Or firewalld
sudo firewall-cmd --permanent --add-port=6819/tcp
sudo firewall-cmd --reload

Step 9: Restart slurmctld

On the controller:

sudo systemctl restart slurmctld

# Check it connected to slurmdbd
sudo tail -f /var/log/slurm/slurmctld.log

Look for:

slurmctld: accounting_storage/slurmdbd: init: AccountingStorageHost=dbserver.example.com:6819
slurmctld: accounting_storage/slurmdbd: init: Database connection established

Step 10: Create Cluster in Database

Register your cluster with the accounting database:

sudo sacctmgr add cluster mycluster

Verify:

sacctmgr show cluster
   Cluster     ControlHost  ControlPort   RPC     Share GrpJobs       GrpTRES GrpSubmit MaxJobs       MaxTRES MaxSubmit     MaxWall                  QOS   Def QOS
---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- ---------
 mycluster  controller.ex.         6817  9728         1                                                                                           normal

Step 11: Create Accounts

Create your account hierarchy:

# Create parent account (organisation)
sudo sacctmgr add account science Description="Science Division" Organization="MyOrg"

# Create department accounts under science
sudo sacctmgr add account physics Description="Physics Department" Organization="MyOrg" Parent=science
sudo sacctmgr add account chemistry Description="Chemistry Department" Organization="MyOrg" Parent=science
sudo sacctmgr add account biology Description="Biology Department" Organization="MyOrg" Parent=science

# Create standalone accounts
sudo sacctmgr add account ai Description="AI Research" Organization="MyOrg"
sudo sacctmgr add account engineering Description="Engineering" Organization="MyOrg"

View account hierarchy:

sacctmgr show account -s
   Account                Descr                  Org
---------- -------------------- --------------------
   science       Science Division                MyOrg
    physics    Physics Department                MyOrg
  chemistry  Chemistry Department                MyOrg
    biology    Biology Department                MyOrg
        ai          AI Research                MyOrg
engineering          Engineering                MyOrg

Step 12: Add Users to Accounts

# Add users to accounts
sudo sacctmgr add user jsmith Account=physics
sudo sacctmgr add user kwilson Account=ai
sudo sacctmgr add user pjones Account=chemistry

# User can belong to multiple accounts
sudo sacctmgr add user jsmith Account=ai

# Set default account for user
sudo sacctmgr modify user jsmith set DefaultAccount=physics

View user associations:

sacctmgr show assoc format=Cluster,Account,User,Partition,Share,MaxJobs,MaxCPUs
   Cluster    Account       User  Partition     Share  MaxJobs  MaxCPUs
---------- ---------- ---------- ---------- --------- -------- --------
 mycluster    physics     jsmith                    1
 mycluster         ai     jsmith                    1
 mycluster         ai    kwilson                    1
 mycluster  chemistry     pjones                    1

Step 13: Set Resource Limits

Apply limits at account or user level:

# Limit physics account to 500 CPUs max, 50 concurrent jobs
sudo sacctmgr modify account physics set MaxCPUs=500 MaxJobs=50

# Limit specific user
sudo sacctmgr modify user jsmith set MaxCPUs=100 MaxJobs=10

# Limit by partition
sudo sacctmgr modify user jsmith where partition=gpu set MaxCPUs=32 MaxJobs=2

View limits:

sacctmgr show assoc format=Cluster,Account,User,Partition,MaxJobs,MaxCPUs,MaxNodes
   Cluster    Account       User  Partition  MaxJobs  MaxCPUs MaxNodes
---------- ---------- ---------- ---------- -------- -------- --------
 mycluster    physics                              50      500
 mycluster    physics     jsmith                   10      100
 mycluster    physics     jsmith        gpu         2       32

Step 14: Configure Fairshare

Fairshare adjusts job priority based on historical usage. Heavy users get lower priority.

# Set shares (relative weight) for accounts
sudo sacctmgr modify account physics set Fairshare=100
sudo sacctmgr modify account chemistry set Fairshare=100
sudo sacctmgr modify account ai set Fairshare=200  # AI gets double weight

Enable fairshare in slurm.conf on the controller:

# Priority settings
PriorityType=priority/multifactor
PriorityWeightFairshare=10000
PriorityWeightAge=1000
PriorityWeightPartition=1000
PriorityWeightJobSize=500
PriorityDecayHalfLife=7-0
PriorityUsageResetPeriod=MONTHLY

Restart slurmctld after changes:

sudo systemctl restart slurmctld

Step 15: Verify Everything Works

Test job submission with accounting:

# Submit job with account
sbatch --account=physics --job-name=test --wrap="sleep 60"

# Check it's tracked
squeue
sacct -j JOBID

Check database connectivity:

# From controller
sacctmgr show cluster
sacctmgr show account
sacctmgr show assoc

Verify accounting is enforced:

# Try submitting without valid account (should fail if enforce=associations)
sbatch --account=nonexistent --wrap="hostname"
# Expected: error: Unable to allocate resources: Invalid account

Check usage reports:

sreport cluster utilization
sreport user top start=2026-01-01
sreport account top start=2026-01-01

Useful sacctmgr Commands

Command Purpose
sacctmgr show cluster List registered clusters
sacctmgr show account List all accounts
sacctmgr show account -s Show account hierarchy
sacctmgr show user List all users
sacctmgr show assoc Show all associations (user-account mappings)
sacctmgr add account NAME Create new account
sacctmgr add user NAME Account=X Add user to account
sacctmgr modify account X set MaxCPUs=Y Set account limits
sacctmgr modify user X set MaxJobs=Y Set user limits
sacctmgr delete user NAME Account=X Remove user from account
sacctmgr delete account NAME Delete account

Troubleshooting

slurmdbd won’t start

# Check logs
sudo tail -100 /var/log/slurm/slurmdbd.log

# Common issues:
# - Wrong database credentials in slurmdbd.conf
# - MySQL not running
# - Permissions on slurmdbd.conf (must be 600, owned by slurm)
# - Munge not running

slurmctld can’t connect to slurmdbd

# Test connectivity
telnet dbserver.example.com 6819

# Check firewall
sudo ufw status
sudo firewall-cmd --list-all

# Verify slurmdbd is listening
ss -tlnp | grep 6819

Jobs not being tracked

# Verify accounting is enabled
scontrol show config | grep AccountingStorage

# Should show:
# AccountingStorageType = accounting_storage/slurmdbd

# Check association exists for user
sacctmgr show assoc user=jsmith

Database connection errors

# Test MySQL connection from slurmdbd host
mysql -h localhost -u slurm -p slurm_acct_db

# Check MySQL is accepting connections
sudo systemctl status mariadb
sudo tail -100 /var/log/mysql/error.log

My Thoughts

Setting up SLURM accounting properly from the start saves headaches later. Once it’s running, you get automatic tracking of every job, fair scheduling between groups, and the data you need for billing and capacity planning.

Key points to remember:

  • Keep the database separate from the controller in production
  • slurmdbd is the middleman — controller never hits the database directly
  • Compute nodes don’t need database access, they just run jobs
  • Set up your account hierarchy before adding users
  • Use AccountingStorageEnforce to make accounting mandatory
  • Fairshare prevents any single group from hogging the cluster

The database is your audit trail. It tracks everything, so when someone asks “why is my job slow” or “how much did we use last month”, you have the answers.

Nick Tailor Notes…Essential SLURM Diagnostic Commands: Outputs and What They Mean

When managing HPC clusters, knowing how to quickly diagnose job issues, node problems, and cluster health is essential. SLURM provides a comprehensive set of commands for this purpose, but understanding the output is just as important as knowing which command to run.

This post covers the most common SLURM diagnostic commands, their expected outputs, and how to interpret what you’re seeing.


Job Information

squeue — View Job Queue

The squeue command shows jobs currently in the queue (running and pending).

$ squeue
JOBID   PARTITION   NAME        USER      ST   TIME       NODES  NODELIST(REASON)
12345   batch       simulate    jsmith    R    2:34:15    4      node[001-004]
12346   gpu         train_ml    kwilson   R    0:45:22    1      gpu01
12347   batch       analysis    jsmith    PD   0:00       2      (Resources)
12348   long        climate     pjones    PD   0:00       8      (Priority)

Key columns:

  • ST (State) — R=Running, PD=Pending, CG=Completing, F=Failed
  • TIME — How long the job has been running
  • NODELIST(REASON) — Which nodes it’s on, or why it’s pending

Common pending reasons:

  • (Resources) — Waiting for requested resources to become available
  • (Priority) — Other jobs have higher priority
  • (ReqNodeNotAvail) — Requested nodes are down or reserved
  • (QOSMaxJobsPerUserLimit) — User hit their job limit
  • (Dependency) — Waiting for another job to complete

Filter by user or partition:

$ squeue -u jsmith
JOBID   PARTITION   NAME        USER      ST   TIME       NODES  NODELIST(REASON)
12345   batch       simulate    jsmith    R    2:34:15    4      node[001-004]
12347   batch       analysis    jsmith    PD   0:00       2      (Resources)

$ squeue -p gpu
JOBID   PARTITION   NAME        USER      ST   TIME       NODES  NODELIST(REASON)
12346   gpu         train_ml    kwilson   R    0:45:22    1      gpu01

scontrol show job — Detailed Job Information

Use scontrol show job to get comprehensive details about a specific job.

$ scontrol show job 12345
JobId=12345 JobName=simulate
   UserId=jsmith(1001) GroupId=research(100) MCS_label=N/A
   Priority=4294901720 Nice=0 Account=physics QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=02:34:15 TimeLimit=12:00:00 TimeMin=N/A
   SubmitTime=2026-01-17T08:00:00 EligibleTime=2026-01-17T08:00:05
   AccrueTime=2026-01-17T08:00:05
   StartTime=2026-01-17T08:00:10 EndTime=2026-01-17T20:00:10 Deadline=N/A
   PreemptEligibleTime=2026-01-17T08:00:10 PreemptTime=None
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2026-01-17T08:00:05
   Partition=batch AllocNode:Sid=login01:54321
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=node[001-004]
   BatchHost=node001
   NumNodes=4 NumCPUs=32 NumTasks=32 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=32,mem=256G,node=4,billing=32
   Socks/Node=* NtasksPerN:B:S:C=8:0:*:* CoreSpec=*
   MinCPUsNode=8 MinMemoryNode=64G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/jsmith/jobs/simulate.sh
   WorkDir=/home/jsmith/jobs
   StdErr=/home/jsmith/jobs/simulate_12345.out
   StdIn=/dev/null
   StdOut=/home/jsmith/jobs/simulate_12345.out
   Power=

Key fields to check:

  • JobState — Current state (RUNNING, PENDING, FAILED, COMPLETED, TIMEOUT)
  • Reason — Why job is pending or failed
  • Priority — Job priority (higher = scheduled sooner)
  • RunTime vs TimeLimit — How long it’s run vs maximum allowed
  • NodeList — Which nodes the job is running on
  • ExitCode — Exit status (0:0 = success, non-zero = failure)
  • TRES — Resources allocated (CPUs, memory, GPUs)

sacct — Job Statistics After Completion

Use sacct to view resource usage after a job completes. This is essential for understanding why jobs failed or ran slowly.

$ sacct -j 12345 --format=JobID,JobName,Partition,State,Elapsed,MaxRSS,MaxVMSize,CPUTime,ExitCode
JobID           JobName    Partition  State      Elapsed     MaxRSS   MaxVMSize    CPUTime  ExitCode
------------ ---------- ---------- ---------- ---------- ---------- ---------- ---------- --------
12345          simulate      batch  COMPLETED   02:45:30                         88:00:00      0:0
12345.batch       batch             COMPLETED   02:45:30   52428800K  62914560K  02:45:30      0:0
12345.0        simulate             COMPLETED   02:45:30   48576512K  58720256K  85:14:30      0:0

Key columns:

  • State — COMPLETED, FAILED, TIMEOUT, OUT_OF_MEMORY, CANCELLED
  • Elapsed — Actual wall time used
  • MaxRSS — Peak memory usage (resident set size)
  • CPUTime — Total CPU time consumed (cores × wall time)
  • ExitCode — Exit status (0:0 = success)

Common failure states:

  • FAILED — Job exited with non-zero exit code
  • TIMEOUT — Job exceeded time limit
  • OUT_OF_MEMORY — Job exceeded memory limit (exit code 0:137)
  • CANCELLED — Job was cancelled by user or admin

Check all jobs since midnight:

$ sacct -a --starttime=midnight --format=JobID,User,Partition,State,Elapsed,ExitCode
JobID             User  Partition      State    Elapsed  ExitCode
------------ --------- ---------- ---------- ---------- --------
12340           jsmith      batch  COMPLETED   01:23:45      0:0
12341          kwilson        gpu  COMPLETED   04:56:12      0:0
12342           pjones       long    TIMEOUT   24:00:00      0:1
12343           jsmith      batch     FAILED   00:05:23      1:0
12344          kwilson        gpu OUT_OF_ME+   00:12:34      0:137

squeue –start — Estimated Start Time

For pending jobs, squeue --start shows when SLURM expects the job to start.

$ squeue -j 12348 --start
JOBID   PARTITION   NAME        USER      ST   START_TIME           NODES  NODELIST(REASON)
12348   long        climate     pjones    PD   2026-01-17T22:00:00  8      (Priority)

If START_TIME shows “N/A” or a date far in the future, the job may be blocked by resource constraints or priority issues.


Node Information

sinfo — Partition and Node Overview

The sinfo command provides a quick overview of cluster partitions and node states.

$ sinfo
PARTITION   AVAIL  TIMELIMIT  NODES  STATE  NODELIST
batch*      up     1-00:00:00    85  idle   node[005-089]
batch*      up     1-00:00:00    10  mix    node[090-099]
batch*      up     1-00:00:00     4  alloc  node[001-004]
batch*      up     1-00:00:00     1  down   node100
gpu         up     1-00:00:00    12  idle   gpu[05-16]
gpu         up     1-00:00:00     4  alloc  gpu[01-04]
highmem     up     2-00:00:00     8  idle   mem[01-08]
debug       up     00:30:00       4  idle   node[001-004]

Node states:

  • idle — Available, no jobs running
  • alloc — Fully allocated to jobs
  • mix — Partially allocated (some CPUs free)
  • down — Unavailable (hardware issue, admin action)
  • drain — Completing current jobs, accepting no new ones
  • drng — Draining with jobs still running

Detailed node list:

$ sinfo -N -l
NODELIST    NODES  PARTITION  STATE       CPUS  S:C:T  MEMORY  TMP_DISK  WEIGHT  AVAIL_FE  REASON
gpu01           1  gpu        allocated     64  2:16:2  512000         0       1  gpu,a100  none
gpu02           1  gpu        allocated     64  2:16:2  512000         0       1  gpu,a100  none
node001         1  batch      allocated     32  2:8:2   256000         0       1  (null)    none
node100         1  batch      down*         32  2:8:2   256000         0       1  (null)    Node unresponsive

scontrol show node — Detailed Node Information

Use scontrol show node for comprehensive details about a specific node.

$ scontrol show node node001
NodeName=node001 Arch=x86_64 CoresPerSocket=16
   CPUAlloc=32 CPUEfctv=32 CPUTot=32 CPULoad=31.45
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=node001 NodeHostName=node001 Version=23.02.4
   OS=Linux 5.15.0-91-generic #101-Ubuntu SMP
   RealMemory=256000 AllocMem=256000 FreeMem=12450 Sockets=2 Boards=1
   State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=batch,debug
   BootTime=2026-01-10T06:00:00 SlurmdStartTime=2026-01-10T06:01:30
   LastBusyTime=2026-01-17T10:34:15
   CfgTRES=cpu=32,mem=256000M,billing=32
   AllocTRES=cpu=32,mem=256000M
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Key fields:

  • State — Current node state
  • CPUAlloc/CPUTot — CPUs in use vs total available
  • CPULoad — Current CPU load (should roughly match CPUAlloc)
  • RealMemory/AllocMem/FreeMem — Memory status in MB
  • Gres — Generic resources (GPUs, etc.)
  • Reason — Why node is down/drained (if applicable)

Check why a node is down:

$ scontrol show node node100 | grep -i "state\|reason"
   State=DOWN+NOT_RESPONDING ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Reason=Node unresponsive [slurm@2026-01-17T09:15:00]

sinfo -R — Nodes With Problems

Quickly list all nodes that have issues and their reasons.

$ sinfo -R
REASON                              USER        TIMESTAMP           NODELIST
Node unresponsive                   slurm       2026-01-17T09:15:00 node100
Hardware failure - memory           admin       2026-01-16T14:30:00 node055
Scheduled maintenance               admin       2026-01-17T06:00:00 node[080-085]
GPU errors detected                 slurm       2026-01-17T08:45:00 gpu07

List only drained and down nodes:

$ sinfo -t drain,down
PARTITION   AVAIL  TIMELIMIT  NODES  STATE  NODELIST
batch*      up     1-00:00:00     1  down   node100
batch*      up     1-00:00:00     6  drain  node[055,080-085]
gpu         up     1-00:00:00     1  drain  gpu07

Cluster Health

sdiag — Scheduler Statistics

The sdiag command shows scheduler performance metrics and can reveal bottlenecks.

$ sdiag
*******************************************************
sdiag output at 2026-01-17T10:45:00
Data since      2026-01-17T06:00:00
*******************************************************
Server thread count: 10
Agent queue size:    0

Jobs submitted: 1,245
Jobs started:   1,198
Jobs completed: 1,156
Jobs failed:    23
Jobs cancelled: 19

Main schedule statistics (microseconds):
    Last cycle:   125,432
    Max cycle:    892,156
    Total cycles: 4,521
    Mean cycle:   145,678
    Mean depth cycle:  1,245
    Cycles per minute: 15

Backfilling stats
    Total backfilled jobs (since last slurm start): 892
    Total backfilled jobs (since last stats cycle start): 156
    Total backfilled heterogeneous job components: 0
    Total cycles: 4,521
    Last cycle when: 2026-01-17T10:44:55
    Last cycle: 234,567
    Max cycle:  1,456,789
    Last depth cycle: 1,892
    Last depth cycle (try sched): 245
    Depth Mean: 1,456
    Depth Mean (try depth): 198
    Last queue length: 89
    Queue length Mean: 76

Key metrics:

  • Jobs failed — High number indicates systemic issues
  • Mean cycle — Scheduler cycle time (high values = slow scheduling)
  • Max cycle — Worst-case scheduler delay
  • Agent queue size — Should be near 0 (backlog indicator)
  • Total backfilled jobs — Shows backfill scheduler effectiveness

sprio — Job Priority Breakdown

Understand why jobs are scheduled in a particular order.

$ sprio -l
JOBID     USER      PRIORITY    AGE       FAIRSHARE  JOBSIZE   PARTITION  QOS
12345     jsmith    100250      1000      50000      250       49000      0
12346     kwilson   98500       500       48000      500       49500      0
12347     jsmith    95000       100       45000      100       49800      0
12348     pjones    85000       2000      33000      1000      49000      0

Priority components:

  • AGE — How long job has been waiting (prevents starvation)
  • FAIRSHARE — Based on historical usage (heavy users get lower priority)
  • JOBSIZE — Smaller jobs may get priority boost
  • PARTITION — Partition-specific priority modifier
  • QOS — Quality of Service priority adjustment

sreport — Usage Reports

Cluster utilisation:

$ sreport cluster utilization
--------------------------------------------------------------------------------
Cluster Utilization 2026-01-01T00:00:00 - 2026-01-17T10:59:59
Usage reported in CPU Minutes
--------------------------------------------------------------------------------
  Cluster     Allocated          Down     PLND Down          Idle    Reserved       Total 
--------- ------------- ------------- ------------- ------------- ------------- -------------
  mycluster     18,456,789       234,567             0     2,345,678             0    21,037,034

Top users by usage:

$ sreport user top start=2026-01-01 end=2026-01-17 -t percent
--------------------------------------------------------------------------------
Top 10 Users 2026-01-01T00:00:00 - 2026-01-16T23:59:59
Usage reported in CPU Minutes
--------------------------------------------------------------------------------
  Cluster     Login     Proper Name       Account   Used   Energy
--------- --------- --------------- ------------- ------ --------
mycluster    jsmith    John Smith         physics  24.5%        0
mycluster   kwilson    Kate Wilson             ai  18.2%        0
mycluster    pjones    Paul Jones        climate  15.8%        0
mycluster    agarcia   Ana Garcia          chem   12.1%        0
mycluster    blee      Brian Lee        biology    9.4%        0

Troubleshooting

scontrol ping — Controller Status

$ scontrol ping
Slurmctld(primary) at slurmctl01 is UP
Slurmctld(backup) at slurmctl02 is UP

If the controller is down, no jobs can be scheduled and commands will hang or fail.


systemctl status — Daemon Status

Controller daemon:

$ systemctl status slurmctld
● slurmctld.service - Slurm controller daemon
     Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled)
     Active: active (running) since Wed 2026-01-10 06:00:15 UTC; 1 week 0 days ago
   Main PID: 1234 (slurmctld)
      Tasks: 15
     Memory: 2.4G
        CPU: 4h 32min
     CGroup: /system.slice/slurmctld.service
             └─1234 /usr/sbin/slurmctld -D -s

Jan 17 10:44:55 slurmctl01 slurmctld[1234]: sched: Allocate JobId=12350 NodeList=node[010-012]

Compute node daemon:

$ systemctl status slurmd
● slurmd.service - Slurm node daemon
     Loaded: loaded (/lib/systemd/system/slurmd.service; enabled)
     Active: active (running) since Wed 2026-01-10 06:01:30 UTC; 1 week 0 days ago
   Main PID: 5678 (slurmd)
      Tasks: 3
     Memory: 45.2M
        CPU: 12min
     CGroup: /system.slice/slurmd.service
             └─5678 /usr/sbin/slurmd -D -s

Jan 17 10:34:15 node001 slurmd[5678]: launch task StepId=12345.0 request from UID 1001

What to look for:

  • Active: active (running) — Daemon is healthy
  • Active: failed — Daemon has crashed, check logs
  • Memory — Controller memory usage (high values may indicate issues)
  • Recent log entries — Look for errors or warnings

scontrol show config — Running Configuration

Dump the active SLURM configuration to verify settings.

$ scontrol show config | head -30
Configuration data as of 2026-01-17T10:45:00
AccountingStorageBackupHost = (null)
AccountingStorageEnforce    = associations,limits,qos,safe
AccountingStorageHost       = slurmdb01
AccountingStorageParameters = (null)
AccountingStoragePort       = 6819
AccountingStorageType       = accounting_storage/slurmdbd
AccountingStorageUser       = slurm
...

Check specific settings:

$ scontrol show config | grep -i preempt
PreemptMode             = REQUEUE
PreemptType             = preempt/partition_prio
PreemptExemptTime       = 00:00:00

$ scontrol show config | grep -i sched
SchedulerParameters     = bf_continue,bf_max_job_test=1000,default_queue_depth=1000
SchedulerTimeSlice      = 30
SchedulerType           = sched/backfill

Quick Reference

Command Purpose
squeue View job queue
squeue -u user Jobs for specific user
squeue -j jobid --start Estimated start time
scontrol show job jobid Detailed job info
sacct -j jobid Job stats after completion
sinfo Partition and node overview
sinfo -R Nodes with problems
sinfo -t drain,down List problem nodes only
scontrol show node nodename Detailed node info
sdiag Scheduler statistics
sprio -l Job priority breakdown
sreport cluster utilization Cluster usage stats
sreport user top Top users by usage
scontrol ping Check controller status
scontrol show config Running configuration
systemctl status slurmctld Controller daemon status
systemctl status slurmd Compute node daemon status

My Thoughts

Effective SLURM diagnostics comes down to knowing which command gives you what information and being able to interpret the output quickly. When something goes wrong:

  • Start with squeue and sinfo for the big picture
  • Drill down with scontrol show job or scontrol show node
  • Check sacct for jobs that already completed or failed
  • Use sinfo -R to find problem nodes fast
  • Monitor sdiag for scheduler health

Most issues become obvious once you know where to look. The outputs tell you exactly what’s happening — you just need to know how to read them.

SLURM Production Partitions: A Practical Guide to Job Scheduling

When managing HPC clusters in production, how you structure your SLURM partitions directly impacts cluster efficiency, user experience, and resource utilisation. A well-designed partition layout ensures the right jobs land on the right hardware, fair scheduling across user groups, and predictable turnaround times.This post covers typical production partition configurations and provides ready-to-use job script templates for each workload type.


What is a SLURM Partition?

A partition in SLURM is a logical grouping of compute nodes with shared attributes and scheduling policies. Think of partitions as queues  users submit jobs to a partition, and SLURM schedules them according to that partition’s rules.

Partitions allow you to:

  • Separate hardware types (GPU nodes, high-memory nodes, standard compute)
  • Set different time limits and priorities
  • Control access for different user groups
  • Apply different preemption and scheduling policies
  • Track usage for billing and chargeback

Typical Production Partition Layout

A typical production cluster uses partitions structured by resource type and job priority:

# slurm.conf partition configuration

PartitionName=batch    Nodes=node[001-100]  Default=YES  MaxTime=24:00:00  State=UP
PartitionName=short    Nodes=node[001-100]  MaxTime=1:00:00   Priority=100  State=UP
PartitionName=long     Nodes=node[001-100]  MaxTime=7-00:00:00  Priority=10  State=UP
PartitionName=gpu      Nodes=gpu[01-16]     MaxTime=24:00:00  State=UP
PartitionName=highmem  Nodes=mem[01-08]     MaxTime=24:00:00  State=UP
PartitionName=debug    Nodes=node[001-004]  MaxTime=00:30:00  Priority=200  State=UP
PartitionName=preempt  Nodes=node[001-100]  MaxTime=24:00:00  PreemptMode=REQUEUE  State=UP

Partition Definitions

batch

The batch partition is the default queue where most standard compute jobs land. It provides a balance between time limits and priority, suitable for the majority of production workloads. If a user submits a job without specifying a partition, it goes here.

short

The short partition is for quick jobs that need fast turnaround. Higher priority ensures these jobs start quickly, but strict time limits (typically 1 hour or less) prevent users from abusing it for long-running work. Ideal for pre-processing, quick analyses, and iterative development.

long

The long partition accommodates multi-day jobs such as climate simulations, molecular dynamics, or large-scale training runs. Lower priority prevents these jobs from blocking shorter work, but they get scheduled during quieter periods or through backfill.

gpu

The gpu partition contains nodes equipped with GPUs (NVIDIA A100s, H100s, etc.). Separating GPU resources ensures expensive accelerators aren’t wasted on CPU-only workloads and allows for GPU-specific scheduling policies and billing.

highmem

The highmem partition groups high-memory nodes (typically 1TB+ RAM) for memory-intensive workloads like genome assembly, large-scale data analysis, or in-memory databases. These nodes are expensive, so isolating them prevents standard jobs from occupying them unnecessarily.

debug

The debug partition provides rapid access for testing and development. Highest priority and very short time limits (15-30 minutes) ensure users can quickly validate their scripts before submitting large production jobs. Usually limited to a small subset of nodes.

preempt

The preempt partition offers opportunistic access to idle resources. Jobs here can be killed and requeued when higher-priority work arrives. Ideal for fault-tolerant workloads that checkpoint regularly. Users get free cycles in exchange for accepting interruption.


Job Script Templates

Below are production-ready job script templates for each partition type. Adjust resource requests to match your specific workload requirements.

Standard Batch Job

Use the batch partition for typical compute workloads with moderate runtime requirements.

#!/bin/bash
#SBATCH --job-name=simulation
#SBATCH --partition=batch
#SBATCH --nodes=2
#SBATCH --ntasks=32
#SBATCH --cpus-per-task=1
#SBATCH --mem=64G
#SBATCH --time=12:00:00
#SBATCH --output=%x_%j.out

module load openmpi/4.1.4
mpirun ./simulate --input data.in

Debug Job

Use the debug partition to quickly test job scripts before submitting large production runs. Keep it short — this partition is for validation, not real work.

#!/bin/bash
#SBATCH --job-name=test_run
#SBATCH --partition=debug
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --time=00:15:00
#SBATCH --output=%x_%j.out

# Quick sanity check before submitting big job
./app --test-mode

GPU Training Job

Use the gpu partition for machine learning training, rendering, or any GPU-accelerated workload. Request specific GPU counts and ensure CUDA environments are loaded.

#!/bin/bash
#SBATCH --job-name=train_model
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --gpus=4
#SBATCH --mem=128G
#SBATCH --time=24:00:00
#SBATCH --output=%x_%j.out

module load cuda/12.2 python/3.11

CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py --epochs 100

High Memory Job

Use the highmem partition for memory-intensive workloads that exceed standard node capacity. Common use cases include genome assembly, large graph processing, and in-memory analytics.

#!/bin/bash
#SBATCH --job-name=genome_assembly
#SBATCH --partition=highmem
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=32
#SBATCH --mem=1T
#SBATCH --time=48:00:00
#SBATCH --output=%x_%j.out

module load assembler/2.1

assembler --threads 32 --memory 900G --input reads.fastq

Long Running Job

Use the long partition for multi-day simulations. Always enable email notifications for job completion or failure, and implement checkpointing for fault tolerance.

#!/bin/bash
#SBATCH --job-name=climate_sim
#SBATCH --partition=long
#SBATCH --nodes=8
#SBATCH --ntasks=256
#SBATCH --time=7-00:00:00
#SBATCH --output=%x_%j.out
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=user@company.com

module load openmpi netcdf

mpirun ./climate_model --checkpoint-interval 6h

Preemptible Backfill Job

Use the preempt partition for opportunistic workloads that can tolerate interruption. The --requeue flag ensures the job restarts if preempted. Your application must support checkpointing and resumption.

#!/bin/bash
#SBATCH --job-name=backfill_work
#SBATCH --partition=preempt
#SBATCH --nodes=4
#SBATCH --ntasks=64
#SBATCH --time=24:00:00
#SBATCH --requeue
#SBATCH --output=%x_%j.out

# Must handle being killed and restarted
./app --checkpoint-dir=/scratch/checkpoints --resume

SBATCH Directive Reference

Common SBATCH directives used across job scripts:

Directive Purpose Example
--job-name Job identifier in queue and logs --job-name=my_simulation
--partition Target partition/queue --partition=gpu
--nodes Number of nodes required --nodes=4
--ntasks Total number of tasks (MPI ranks) --ntasks=64
--cpus-per-task CPU cores per task (for threading) --cpus-per-task=8
--mem Memory per node --mem=128G
--gpus Number of GPUs required --gpus=4
--time Maximum wall time (D-HH:MM:SS) --time=24:00:00
--output Standard output file (%x=job name, %j=job ID) --output=%x_%j.out
--mail-type Email notification triggers --mail-type=END,FAIL
--requeue Requeue job if preempted or failed --requeue

Partition Selection Guide

Partition Typical Use Case Time Limit Priority
debug Testing scripts before production runs 15-30 min Highest
short Quick jobs, preprocessing, iteration 1 hour High
batch Standard compute workloads 24 hours Normal
gpu ML training, rendering, GPU compute 24 hours Normal
highmem Genomics, large datasets, in-memory work 48 hours Normal
long Multi-day simulations 7 days Low
preempt Opportunistic, fault-tolerant workloads 24 hours Lowest

My Thoughts

A well-structured partition layout is the foundation of effective HPC cluster management. By separating resources by type and priority, you ensure:

  • Users get appropriate resources for their workloads
  • Expensive hardware (GPUs, high-memory nodes) is used efficiently
  • Short jobs don’t get stuck behind long-running simulations
  • Testing and development has fast turnaround
  • Usage can be tracked and billed accurately

Start with the templates above and adjust time limits, priorities, and access controls to match your organisation’s requirements. As your cluster grows, you can add specialised partitions for specific hardware or user groups.

Building a Reusable HPC Diagnostic Harness for NUMA, CPU, GPU, MPI & InfiniBand

When operating HPC and AI infrastructure at scale, performance issues are rarely caused by a single factor. They are usually the result of subtle misalignment between CPU placement, NUMA locality, memory allocation, accelerator topology, or network fabric behaviour.

This post walks through how to build a reusable diagnostic harness that allows you to methodically inspect these layers, collect evidence, and identify the real source of performance problems.


Why Diagnostics Matter in HPC Environments

Modern HPC systems are complex. Schedulers manage CPU ownership, operating systems handle memory allocation, applications introduce their own behaviour, and accelerators depend heavily on topology.

Without proper diagnostics, it is easy to misattribute performance problems to applications, when the real issue lies in infrastructure alignment.


Design Goal

The goal is simple:

One reusable script where you update a small set of variables, plug in any workload, and receive a complete diagnostic log.

Here’s how we achieve that.


Reusable HPC Diagnostic Wrapper

Below is a diagnostic wrapper script that can be reused across different workloads. Only the variables at the top need to be changed.

The script is only available to clients who hire me through my limited company at this time.

Example Script Output

When you run the diagnostic wrapper on a multi-NUMA HPC node with GPUs and InfiniBand, the complete output looks like this:

=== HPC DIAGNOSTIC RUN ===
Host      : compute-node-42
Timestamp : Sat Jan 17 14:32:01 UTC 2026
Command   : ./app 

=== NUMA TOPOLOGY ===
NUMA node(s):          2
NUMA node0 CPU(s):     0-7
NUMA node1 CPU(s):     8-15

numactl 2.0.14

=== GPU TOPOLOGY ===
index, name, pci.bus_id, memory.total [MiB]
0, NVIDIA A100-SXM4-80GB, 00000000:07:00.0, 81920 MiB
1, NVIDIA A100-SXM4-80GB, 00000000:0B:00.0, 81920 MiB
2, NVIDIA A100-SXM4-80GB, 00000000:48:00.0, 81920 MiB
3, NVIDIA A100-SXM4-80GB, 00000000:4C:00.0, 81920 MiB

=== GPU-NUMA AFFINITY ===
        GPU0    GPU1    GPU2    GPU3    mlx5_0  CPU Affinity    NUMA Affinity
GPU0     X      NV12    SYS     SYS     PIX     0-7             0
GPU1    NV12     X      SYS     SYS     SYS     0-7             0
GPU2    SYS     SYS      X      NV12    SYS     8-15            1
GPU3    SYS     SYS     NV12     X      PIX     8-15            1
mlx5_0  PIX     SYS     SYS     PIX      X

=== INFINIBAND STATUS ===
CA 'mlx5_0'
    CA type: MT4123
    Number of ports: 1
    Firmware version: 20.31.1014
    Hardware version: 0
    Node GUID: 0x1070fd0300123456
    System image GUID: 0x1070fd0300123456
    Port 1:
        State: Active
        Physical state: LinkUp
        Rate: 200
        Base lid: 1
        LMC: 0
        SM lid: 1
        Capability mask: 0x2651e848
        Port GUID: 0x1070fd0300123456
        Link layer: InfiniBand

=== INFINIBAND LINK ===
Infiniband device 'mlx5_0' port 1 status:
    default gid:    fe80:0000:0000:0000:1070:fd03:0012:3456
    base lid:       0x1
    sm lid:         0x1
    state:          4: ACTIVE
    phys state:     5: LinkUp
    rate:           200 Gb/sec (4X HDR)
    link_layer:     InfiniBand

=== STARTING APPLICATION ===
[compute-node-42:12345] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././.][./././././././.]
[compute-node-42:12346] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./././././.][./././././././.]
[compute-node-42:12347] MCW rank 2 bound to socket 1[core 8[hwt 0]]: [./././././././.][B/././././././.]
[compute-node-42:12348] MCW rank 3 bound to socket 1[core 9[hwt 0]]: [./././././././.][./B/./././././.]

=== NUMA POLICY ===
policy: bind
physcpubind: 8
membind: 1

=== CPU AFFINITY ===
pid 12345's current affinity list: 0
pid 12346's current affinity list: 1
pid 12347's current affinity list: 8
pid 12348's current affinity list: 9
  PID PSR COMMAND
12345   0 app
12346   1 app
12347   8 app
12348   9 app

=== NUMA MEMORY STATS ===
Per-node process memory usage (in MBs) for PID 12345 (app)
                           Node 0          Node 1           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                       128.00            0.00          128.00
Stack                        0.12            0.00            0.12
Private                   5120.00            0.00         5120.00
----------------  --------------- --------------- ---------------
Total                     5248.12            0.00         5248.12

=== GPU UTILISATION ===
index, utilization.gpu [%], utilization.memory [%], memory.used [MiB], memory.free [MiB], temperature.gpu
0, 87 %, 45 %, 36864 MiB, 45056 MiB, 62
1, 0 %, 0 %, 0 MiB, 81920 MiB, 34
2, 92 %, 52 %, 42240 MiB, 39680 MiB, 65
3, 0 %, 0 %, 0 MiB, 81920 MiB, 35

=== GPU PROCESS LIST ===
pid, process_name, gpu_bus_id, used_gpu_memory [MiB]
12345, app, 00000000:07:00.0, 36864 MiB
12347, app, 00000000:48:00.0, 42240 MiB

=== INFINIBAND COUNTERS ===
# Port extended counters: Lid 1 port 1 (CapMask: 0x5300)
PortXmitData:.....................124587623456
PortRcvData:......................118745236789
PortXmitPkts:.....................45678912
PortRcvPkts:......................43215678
PortUnicastXmitPkts:..............45678900
PortUnicastRcvPkts:...............43215666
PortMulticastXmitPkts:............12
PortMulticastRcvPkts:.............12

=== COMPLETE ===
Runtime (s): 47

This single log file captures everything you need to verify correct infrastructure alignment across CPU, memory, GPU, and network fabric. The following sections explain how to interpret each part of this output.


Interpreting the Diagnostic Output

Each section of the output tells you something specific about how your workload is interacting with the underlying hardware. Here’s how to read each one.

NUMA Binding (numactl –show)

Good output:

policy: bind
physcpubind: 8
membind: 1

This confirms that the process is pinned to CPU 8 and all memory allocations are restricted to NUMA node 1.

Bad output:

policy: default
physcpubind: 8
membind: 0 1

Memory is being allocated across multiple NUMA nodes, resulting in cross-socket access, higher latency, and unstable performance.


NUMA Memory Locality (numastat -p)

Good output:

Per-node process memory usage (MB)
Node 0:      0
Node 1:  10240

All memory usage is local to the NUMA node where the process is running. This is the expected and optimal behaviour.

Bad output:

Per-node process memory usage (MB)
Node 0:   4096
Node 1:   6144

Memory is split across NUMA nodes. This commonly leads to unpredictable runtimes, MPI slowdowns, and reduced GPU efficiency.


CPU Affinity (ps / taskset)

Good output:

PID   PSR  COMMAND
1234   8   app

pid 1234's current affinity list: 8

The process remains on the intended CPU and does not migrate between cores. Cache locality is preserved.

Bad output:

PID   PSR  COMMAND
1234   3   app

pid 1234's current affinity list: 0-15

The process has migrated to a different CPU. This usually indicates missing or ineffective CPU binding.


GPU-NUMA Affinity (nvidia-smi topo -m)

Good output:

        GPU0    GPU1    mlx5_0  CPU Affinity    NUMA Affinity
GPU0     X      NV12    PIX     0-7             0
GPU1    NV12     X      SYS     0-7             0
mlx5_0  PIX     SYS      X

This shows GPU0 and the InfiniBand adapter (mlx5_0) share the same PCIe switch (PIX), meaning GPU-to-network transfers bypass the CPU entirely. Both GPUs are local to NUMA node 0.

Bad output:

        GPU0    GPU1    mlx5_0  CPU Affinity    NUMA Affinity
GPU0     X      SYS     SYS     0-7             0
GPU1    SYS      X      SYS     8-15            1
mlx5_0  SYS     SYS      X

All devices are connected via SYS (system/QPI), meaning every GPU-to-GPU and GPU-to-network transfer must traverse the CPU interconnect. This adds latency and consumes memory bandwidth.

Key topology indicators:

  • NV# — NVLink connection (fastest GPU-to-GPU)
  • PIX — Same PCIe switch (fast, CPU-bypass)
  • PXB — Same PCIe bridge (good)
  • SYS — Crosses CPU/QPI (slowest, avoid for latency-sensitive workloads)

GPU Utilisation (nvidia-smi)

Good output:

index, utilization.gpu [%], memory.used [MiB], temperature.gpu
0, 95 %, 72000 MiB, 68

GPU is highly utilised, memory is well allocated, and temperature is within operating range. The workload is GPU-bound as expected.

Bad output:

index, utilization.gpu [%], memory.used [MiB], temperature.gpu
0, 12 %, 8000 MiB, 42

Low GPU utilisation with minimal memory usage suggests the workload is CPU-bound or waiting on I/O. Check for data loading bottlenecks, CPU preprocessing stalls, or incorrect batch sizes.


InfiniBand Status (ibstat / ibstatus)

Good output:

Port 1:
    State: Active
    Physical state: LinkUp
    Rate: 200

The InfiniBand port is active, physically connected, and running at expected speed (200 Gb/s HDR).

Bad output:

Port 1:
    State: Down
    Physical state: Polling
    Rate: 10

The port is not connected or is negotiating at a much lower speed. Check cables, switch configuration, and subnet manager status.

Common link states:

  • Active / LinkUp — Normal operation
  • Init / LinkUp — Waiting for subnet manager
  • Down / Polling — No physical connection or cable fault
  • Armed — Link trained but not yet activated

InfiniBand Counters (perfquery)

Good output:

PortXmitData:.....................124587623456
PortRcvData:......................118745236789
PortXmitPkts:.....................45678912
PortRcvPkts:......................43215678

Data is flowing in both directions with balanced transmit and receive counts.

Bad output:

PortXmitData:.....................124587623456
PortRcvData:......................0
SymbolErrorCounter:...............4521
LinkDownedCounter:................12

Zero receive data with symbol errors and link-down events indicates cable or transceiver problems. Physical layer inspection is required.

Key counters to watch:

  • SymbolErrorCounter — Bit errors on the wire (should be 0)
  • LinkDownedCounter — Link reset events (should be 0 during operation)
  • PortRcvErrors — Malformed packets received
  • PortXmitDiscards — Packets dropped due to congestion

MPI Rank Binding (–report-bindings)

Good output:

[node:12345] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././.][./././././././.]
[node:12346] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./././././.][./././././././.]
[node:12347] MCW rank 2 bound to socket 1[core 8[hwt 0]]: [./././././././.][B/././././././.]
[node:12348] MCW rank 3 bound to socket 1[core 9[hwt 0]]: [./././././././.][./B/./././././.]

Each MPI rank is bound to a specific core, distributed evenly across NUMA nodes. The B indicates where each rank is pinned.

Bad output:

[node:12345] MCW rank 0 is not bound (or bound to all available processors)
[node:12346] MCW rank 1 is not bound (or bound to all available processors)
[node:12347] MCW rank 2 is not bound (or bound to all available processors)
[node:12348] MCW rank 3 is not bound (or bound to all available processors)

MPI ranks are floating across all CPUs. This causes cache thrashing, cross-NUMA memory access, and inconsistent performance. Add --bind-to core to your mpirun command.


Diagnosing the Root Cause

By comparing good and bad outputs, we can narrow down the root cause:

  • Cross-NUMA memory allocation — indicates locality problems, often caused by missing --membind or memory allocated before binding was applied
  • CPU migration — points to missing or overridden affinity, commonly from scheduler interference or missing --physcpubind
  • Low GPU utilisation — suggests CPU bottleneck, data loading stalls, or incorrect CUDA device selection
  • GPU-NUMA mismatch — process running on wrong NUMA node relative to GPU, causing PCIe traffic to cross CPU socket
  • SYS topology between GPU and NIC — GPU-direct RDMA will underperform; consider workload placement or hardware topology changes
  • InfiniBand errors — physical layer problems requiring cable, transceiver, or switch port inspection
  • Unbound MPI ranks — missing binding flags causing rank migration and cache invalidation
  • High runtime variance — usually correlates with topology misalignment and can be confirmed by checking the above metrics across multiple runs

This comparison-driven approach removes guesswork and makes infrastructure-level issues easy to identify and prove.


My Thoughts

When running HPC systems you need to diagnose with more information to help you figure out where the problem lies.

Collecting CPU placement, NUMA locality, memory allocation, GPU topology, InfiniBand status, and MPI binding together allows you to methodically narrow down the root cause instead of guessing.

When these signals line up, performance is predictable and consistent. When they do not, the logs will usually tell you exactly what is wrong.