Nick Tailor Notes…Essential SLURM Diagnostic Commands: Outputs and What They Mean

When managing HPC clusters, knowing how to quickly diagnose job issues, node problems, and cluster health is essential. SLURM provides a comprehensive set of commands for this purpose, but understanding the output is just as important as knowing which command to run.

This post covers the most common SLURM diagnostic commands, their expected outputs, and how to interpret what you’re seeing.


Job Information

squeue — View Job Queue

The squeue command shows jobs currently in the queue (running and pending).

$ squeue
JOBID   PARTITION   NAME        USER      ST   TIME       NODES  NODELIST(REASON)
12345   batch       simulate    jsmith    R    2:34:15    4      node[001-004]
12346   gpu         train_ml    kwilson   R    0:45:22    1      gpu01
12347   batch       analysis    jsmith    PD   0:00       2      (Resources)
12348   long        climate     pjones    PD   0:00       8      (Priority)

Key columns:

  • ST (State) — R=Running, PD=Pending, CG=Completing, F=Failed
  • TIME — How long the job has been running
  • NODELIST(REASON) — Which nodes it’s on, or why it’s pending

Common pending reasons:

  • (Resources) — Waiting for requested resources to become available
  • (Priority) — Other jobs have higher priority
  • (ReqNodeNotAvail) — Requested nodes are down or reserved
  • (QOSMaxJobsPerUserLimit) — User hit their job limit
  • (Dependency) — Waiting for another job to complete

Filter by user or partition:

$ squeue -u jsmith
JOBID   PARTITION   NAME        USER      ST   TIME       NODES  NODELIST(REASON)
12345   batch       simulate    jsmith    R    2:34:15    4      node[001-004]
12347   batch       analysis    jsmith    PD   0:00       2      (Resources)

$ squeue -p gpu
JOBID   PARTITION   NAME        USER      ST   TIME       NODES  NODELIST(REASON)
12346   gpu         train_ml    kwilson   R    0:45:22    1      gpu01

scontrol show job — Detailed Job Information

Use scontrol show job to get comprehensive details about a specific job.

$ scontrol show job 12345
JobId=12345 JobName=simulate
   UserId=jsmith(1001) GroupId=research(100) MCS_label=N/A
   Priority=4294901720 Nice=0 Account=physics QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=02:34:15 TimeLimit=12:00:00 TimeMin=N/A
   SubmitTime=2026-01-17T08:00:00 EligibleTime=2026-01-17T08:00:05
   AccrueTime=2026-01-17T08:00:05
   StartTime=2026-01-17T08:00:10 EndTime=2026-01-17T20:00:10 Deadline=N/A
   PreemptEligibleTime=2026-01-17T08:00:10 PreemptTime=None
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2026-01-17T08:00:05
   Partition=batch AllocNode:Sid=login01:54321
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=node[001-004]
   BatchHost=node001
   NumNodes=4 NumCPUs=32 NumTasks=32 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=32,mem=256G,node=4,billing=32
   Socks/Node=* NtasksPerN:B:S:C=8:0:*:* CoreSpec=*
   MinCPUsNode=8 MinMemoryNode=64G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/jsmith/jobs/simulate.sh
   WorkDir=/home/jsmith/jobs
   StdErr=/home/jsmith/jobs/simulate_12345.out
   StdIn=/dev/null
   StdOut=/home/jsmith/jobs/simulate_12345.out
   Power=

Key fields to check:

  • JobState — Current state (RUNNING, PENDING, FAILED, COMPLETED, TIMEOUT)
  • Reason — Why job is pending or failed
  • Priority — Job priority (higher = scheduled sooner)
  • RunTime vs TimeLimit — How long it’s run vs maximum allowed
  • NodeList — Which nodes the job is running on
  • ExitCode — Exit status (0:0 = success, non-zero = failure)
  • TRES — Resources allocated (CPUs, memory, GPUs)

sacct — Job Statistics After Completion

Use sacct to view resource usage after a job completes. This is essential for understanding why jobs failed or ran slowly.

$ sacct -j 12345 --format=JobID,JobName,Partition,State,Elapsed,MaxRSS,MaxVMSize,CPUTime,ExitCode
JobID           JobName    Partition  State      Elapsed     MaxRSS   MaxVMSize    CPUTime  ExitCode
------------ ---------- ---------- ---------- ---------- ---------- ---------- ---------- --------
12345          simulate      batch  COMPLETED   02:45:30                         88:00:00      0:0
12345.batch       batch             COMPLETED   02:45:30   52428800K  62914560K  02:45:30      0:0
12345.0        simulate             COMPLETED   02:45:30   48576512K  58720256K  85:14:30      0:0

Key columns:

  • State — COMPLETED, FAILED, TIMEOUT, OUT_OF_MEMORY, CANCELLED
  • Elapsed — Actual wall time used
  • MaxRSS — Peak memory usage (resident set size)
  • CPUTime — Total CPU time consumed (cores × wall time)
  • ExitCode — Exit status (0:0 = success)

Common failure states:

  • FAILED — Job exited with non-zero exit code
  • TIMEOUT — Job exceeded time limit
  • OUT_OF_MEMORY — Job exceeded memory limit (exit code 0:137)
  • CANCELLED — Job was cancelled by user or admin

Check all jobs since midnight:

$ sacct -a --starttime=midnight --format=JobID,User,Partition,State,Elapsed,ExitCode
JobID             User  Partition      State    Elapsed  ExitCode
------------ --------- ---------- ---------- ---------- --------
12340           jsmith      batch  COMPLETED   01:23:45      0:0
12341          kwilson        gpu  COMPLETED   04:56:12      0:0
12342           pjones       long    TIMEOUT   24:00:00      0:1
12343           jsmith      batch     FAILED   00:05:23      1:0
12344          kwilson        gpu OUT_OF_ME+   00:12:34      0:137

squeue –start — Estimated Start Time

For pending jobs, squeue --start shows when SLURM expects the job to start.

$ squeue -j 12348 --start
JOBID   PARTITION   NAME        USER      ST   START_TIME           NODES  NODELIST(REASON)
12348   long        climate     pjones    PD   2026-01-17T22:00:00  8      (Priority)

If START_TIME shows “N/A” or a date far in the future, the job may be blocked by resource constraints or priority issues.


Node Information

sinfo — Partition and Node Overview

The sinfo command provides a quick overview of cluster partitions and node states.

$ sinfo
PARTITION   AVAIL  TIMELIMIT  NODES  STATE  NODELIST
batch*      up     1-00:00:00    85  idle   node[005-089]
batch*      up     1-00:00:00    10  mix    node[090-099]
batch*      up     1-00:00:00     4  alloc  node[001-004]
batch*      up     1-00:00:00     1  down   node100
gpu         up     1-00:00:00    12  idle   gpu[05-16]
gpu         up     1-00:00:00     4  alloc  gpu[01-04]
highmem     up     2-00:00:00     8  idle   mem[01-08]
debug       up     00:30:00       4  idle   node[001-004]

Node states:

  • idle — Available, no jobs running
  • alloc — Fully allocated to jobs
  • mix — Partially allocated (some CPUs free)
  • down — Unavailable (hardware issue, admin action)
  • drain — Completing current jobs, accepting no new ones
  • drng — Draining with jobs still running

Detailed node list:

$ sinfo -N -l
NODELIST    NODES  PARTITION  STATE       CPUS  S:C:T  MEMORY  TMP_DISK  WEIGHT  AVAIL_FE  REASON
gpu01           1  gpu        allocated     64  2:16:2  512000         0       1  gpu,a100  none
gpu02           1  gpu        allocated     64  2:16:2  512000         0       1  gpu,a100  none
node001         1  batch      allocated     32  2:8:2   256000         0       1  (null)    none
node100         1  batch      down*         32  2:8:2   256000         0       1  (null)    Node unresponsive

scontrol show node — Detailed Node Information

Use scontrol show node for comprehensive details about a specific node.

$ scontrol show node node001
NodeName=node001 Arch=x86_64 CoresPerSocket=16
   CPUAlloc=32 CPUEfctv=32 CPUTot=32 CPULoad=31.45
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=node001 NodeHostName=node001 Version=23.02.4
   OS=Linux 5.15.0-91-generic #101-Ubuntu SMP
   RealMemory=256000 AllocMem=256000 FreeMem=12450 Sockets=2 Boards=1
   State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=batch,debug
   BootTime=2026-01-10T06:00:00 SlurmdStartTime=2026-01-10T06:01:30
   LastBusyTime=2026-01-17T10:34:15
   CfgTRES=cpu=32,mem=256000M,billing=32
   AllocTRES=cpu=32,mem=256000M
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Key fields:

  • State — Current node state
  • CPUAlloc/CPUTot — CPUs in use vs total available
  • CPULoad — Current CPU load (should roughly match CPUAlloc)
  • RealMemory/AllocMem/FreeMem — Memory status in MB
  • Gres — Generic resources (GPUs, etc.)
  • Reason — Why node is down/drained (if applicable)

Check why a node is down:

$ scontrol show node node100 | grep -i "state\|reason"
   State=DOWN+NOT_RESPONDING ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Reason=Node unresponsive [slurm@2026-01-17T09:15:00]

sinfo -R — Nodes With Problems

Quickly list all nodes that have issues and their reasons.

$ sinfo -R
REASON                              USER        TIMESTAMP           NODELIST
Node unresponsive                   slurm       2026-01-17T09:15:00 node100
Hardware failure - memory           admin       2026-01-16T14:30:00 node055
Scheduled maintenance               admin       2026-01-17T06:00:00 node[080-085]
GPU errors detected                 slurm       2026-01-17T08:45:00 gpu07

List only drained and down nodes:

$ sinfo -t drain,down
PARTITION   AVAIL  TIMELIMIT  NODES  STATE  NODELIST
batch*      up     1-00:00:00     1  down   node100
batch*      up     1-00:00:00     6  drain  node[055,080-085]
gpu         up     1-00:00:00     1  drain  gpu07

Cluster Health

sdiag — Scheduler Statistics

The sdiag command shows scheduler performance metrics and can reveal bottlenecks.

$ sdiag
*******************************************************
sdiag output at 2026-01-17T10:45:00
Data since      2026-01-17T06:00:00
*******************************************************
Server thread count: 10
Agent queue size:    0

Jobs submitted: 1,245
Jobs started:   1,198
Jobs completed: 1,156
Jobs failed:    23
Jobs cancelled: 19

Main schedule statistics (microseconds):
    Last cycle:   125,432
    Max cycle:    892,156
    Total cycles: 4,521
    Mean cycle:   145,678
    Mean depth cycle:  1,245
    Cycles per minute: 15

Backfilling stats
    Total backfilled jobs (since last slurm start): 892
    Total backfilled jobs (since last stats cycle start): 156
    Total backfilled heterogeneous job components: 0
    Total cycles: 4,521
    Last cycle when: 2026-01-17T10:44:55
    Last cycle: 234,567
    Max cycle:  1,456,789
    Last depth cycle: 1,892
    Last depth cycle (try sched): 245
    Depth Mean: 1,456
    Depth Mean (try depth): 198
    Last queue length: 89
    Queue length Mean: 76

Key metrics:

  • Jobs failed — High number indicates systemic issues
  • Mean cycle — Scheduler cycle time (high values = slow scheduling)
  • Max cycle — Worst-case scheduler delay
  • Agent queue size — Should be near 0 (backlog indicator)
  • Total backfilled jobs — Shows backfill scheduler effectiveness

sprio — Job Priority Breakdown

Understand why jobs are scheduled in a particular order.

$ sprio -l
JOBID     USER      PRIORITY    AGE       FAIRSHARE  JOBSIZE   PARTITION  QOS
12345     jsmith    100250      1000      50000      250       49000      0
12346     kwilson   98500       500       48000      500       49500      0
12347     jsmith    95000       100       45000      100       49800      0
12348     pjones    85000       2000      33000      1000      49000      0

Priority components:

  • AGE — How long job has been waiting (prevents starvation)
  • FAIRSHARE — Based on historical usage (heavy users get lower priority)
  • JOBSIZE — Smaller jobs may get priority boost
  • PARTITION — Partition-specific priority modifier
  • QOS — Quality of Service priority adjustment

sreport — Usage Reports

Cluster utilisation:

$ sreport cluster utilization
--------------------------------------------------------------------------------
Cluster Utilization 2026-01-01T00:00:00 - 2026-01-17T10:59:59
Usage reported in CPU Minutes
--------------------------------------------------------------------------------
  Cluster     Allocated          Down     PLND Down          Idle    Reserved       Total 
--------- ------------- ------------- ------------- ------------- ------------- -------------
  mycluster     18,456,789       234,567             0     2,345,678             0    21,037,034

Top users by usage:

$ sreport user top start=2026-01-01 end=2026-01-17 -t percent
--------------------------------------------------------------------------------
Top 10 Users 2026-01-01T00:00:00 - 2026-01-16T23:59:59
Usage reported in CPU Minutes
--------------------------------------------------------------------------------
  Cluster     Login     Proper Name       Account   Used   Energy
--------- --------- --------------- ------------- ------ --------
mycluster    jsmith    John Smith         physics  24.5%        0
mycluster   kwilson    Kate Wilson             ai  18.2%        0
mycluster    pjones    Paul Jones        climate  15.8%        0
mycluster    agarcia   Ana Garcia          chem   12.1%        0
mycluster    blee      Brian Lee        biology    9.4%        0

Troubleshooting

scontrol ping — Controller Status

$ scontrol ping
Slurmctld(primary) at slurmctl01 is UP
Slurmctld(backup) at slurmctl02 is UP

If the controller is down, no jobs can be scheduled and commands will hang or fail.


systemctl status — Daemon Status

Controller daemon:

$ systemctl status slurmctld
● slurmctld.service - Slurm controller daemon
     Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled)
     Active: active (running) since Wed 2026-01-10 06:00:15 UTC; 1 week 0 days ago
   Main PID: 1234 (slurmctld)
      Tasks: 15
     Memory: 2.4G
        CPU: 4h 32min
     CGroup: /system.slice/slurmctld.service
             └─1234 /usr/sbin/slurmctld -D -s

Jan 17 10:44:55 slurmctl01 slurmctld[1234]: sched: Allocate JobId=12350 NodeList=node[010-012]

Compute node daemon:

$ systemctl status slurmd
● slurmd.service - Slurm node daemon
     Loaded: loaded (/lib/systemd/system/slurmd.service; enabled)
     Active: active (running) since Wed 2026-01-10 06:01:30 UTC; 1 week 0 days ago
   Main PID: 5678 (slurmd)
      Tasks: 3
     Memory: 45.2M
        CPU: 12min
     CGroup: /system.slice/slurmd.service
             └─5678 /usr/sbin/slurmd -D -s

Jan 17 10:34:15 node001 slurmd[5678]: launch task StepId=12345.0 request from UID 1001

What to look for:

  • Active: active (running) — Daemon is healthy
  • Active: failed — Daemon has crashed, check logs
  • Memory — Controller memory usage (high values may indicate issues)
  • Recent log entries — Look for errors or warnings

scontrol show config — Running Configuration

Dump the active SLURM configuration to verify settings.

$ scontrol show config | head -30
Configuration data as of 2026-01-17T10:45:00
AccountingStorageBackupHost = (null)
AccountingStorageEnforce    = associations,limits,qos,safe
AccountingStorageHost       = slurmdb01
AccountingStorageParameters = (null)
AccountingStoragePort       = 6819
AccountingStorageType       = accounting_storage/slurmdbd
AccountingStorageUser       = slurm
...

Check specific settings:

$ scontrol show config | grep -i preempt
PreemptMode             = REQUEUE
PreemptType             = preempt/partition_prio
PreemptExemptTime       = 00:00:00

$ scontrol show config | grep -i sched
SchedulerParameters     = bf_continue,bf_max_job_test=1000,default_queue_depth=1000
SchedulerTimeSlice      = 30
SchedulerType           = sched/backfill

Quick Reference

Command Purpose
squeue View job queue
squeue -u user Jobs for specific user
squeue -j jobid --start Estimated start time
scontrol show job jobid Detailed job info
sacct -j jobid Job stats after completion
sinfo Partition and node overview
sinfo -R Nodes with problems
sinfo -t drain,down List problem nodes only
scontrol show node nodename Detailed node info
sdiag Scheduler statistics
sprio -l Job priority breakdown
sreport cluster utilization Cluster usage stats
sreport user top Top users by usage
scontrol ping Check controller status
scontrol show config Running configuration
systemctl status slurmctld Controller daemon status
systemctl status slurmd Compute node daemon status

My Thoughts

Effective SLURM diagnostics comes down to knowing which command gives you what information and being able to interpret the output quickly. When something goes wrong:

  • Start with squeue and sinfo for the big picture
  • Drill down with scontrol show job or scontrol show node
  • Check sacct for jobs that already completed or failed
  • Use sinfo -R to find problem nodes fast
  • Monitor sdiag for scheduler health

Most issues become obvious once you know where to look. The outputs tell you exactly what’s happening — you just need to know how to read them.

Leave a Reply

Your email address will not be published. Required fields are marked *