When managing HPC clusters, knowing how to quickly diagnose job issues, node problems, and cluster health is essential. SLURM provides a comprehensive set of commands for this purpose, but understanding the output is just as important as knowing which command to run.
This post covers the most common SLURM diagnostic commands, their expected outputs, and how to interpret what you’re seeing.
Job Information
squeue — View Job Queue
The squeue command shows jobs currently in the queue (running and pending).
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
12345 batch simulate jsmith R 2:34:15 4 node[001-004]
12346 gpu train_ml kwilson R 0:45:22 1 gpu01
12347 batch analysis jsmith PD 0:00 2 (Resources)
12348 long climate pjones PD 0:00 8 (Priority)
Key columns:
- ST (State) — R=Running, PD=Pending, CG=Completing, F=Failed
- TIME — How long the job has been running
- NODELIST(REASON) — Which nodes it’s on, or why it’s pending
Common pending reasons:
- (Resources) — Waiting for requested resources to become available
- (Priority) — Other jobs have higher priority
- (ReqNodeNotAvail) — Requested nodes are down or reserved
- (QOSMaxJobsPerUserLimit) — User hit their job limit
- (Dependency) — Waiting for another job to complete
Filter by user or partition:
$ squeue -u jsmith
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
12345 batch simulate jsmith R 2:34:15 4 node[001-004]
12347 batch analysis jsmith PD 0:00 2 (Resources)
$ squeue -p gpu
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
12346 gpu train_ml kwilson R 0:45:22 1 gpu01
scontrol show job — Detailed Job Information
Use scontrol show job to get comprehensive details about a specific job.
$ scontrol show job 12345
JobId=12345 JobName=simulate
UserId=jsmith(1001) GroupId=research(100) MCS_label=N/A
Priority=4294901720 Nice=0 Account=physics QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=02:34:15 TimeLimit=12:00:00 TimeMin=N/A
SubmitTime=2026-01-17T08:00:00 EligibleTime=2026-01-17T08:00:05
AccrueTime=2026-01-17T08:00:05
StartTime=2026-01-17T08:00:10 EndTime=2026-01-17T20:00:10 Deadline=N/A
PreemptEligibleTime=2026-01-17T08:00:10 PreemptTime=None
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2026-01-17T08:00:05
Partition=batch AllocNode:Sid=login01:54321
ReqNodeList=(null) ExcNodeList=(null)
NodeList=node[001-004]
BatchHost=node001
NumNodes=4 NumCPUs=32 NumTasks=32 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=32,mem=256G,node=4,billing=32
Socks/Node=* NtasksPerN:B:S:C=8:0:*:* CoreSpec=*
MinCPUsNode=8 MinMemoryNode=64G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/home/jsmith/jobs/simulate.sh
WorkDir=/home/jsmith/jobs
StdErr=/home/jsmith/jobs/simulate_12345.out
StdIn=/dev/null
StdOut=/home/jsmith/jobs/simulate_12345.out
Power=
Key fields to check:
- JobState — Current state (RUNNING, PENDING, FAILED, COMPLETED, TIMEOUT)
- Reason — Why job is pending or failed
- Priority — Job priority (higher = scheduled sooner)
- RunTime vs TimeLimit — How long it’s run vs maximum allowed
- NodeList — Which nodes the job is running on
- ExitCode — Exit status (0:0 = success, non-zero = failure)
- TRES — Resources allocated (CPUs, memory, GPUs)
sacct — Job Statistics After Completion
Use sacct to view resource usage after a job completes. This is essential for
understanding why jobs failed or ran slowly.
$ sacct -j 12345 --format=JobID,JobName,Partition,State,Elapsed,MaxRSS,MaxVMSize,CPUTime,ExitCode
JobID JobName Partition State Elapsed MaxRSS MaxVMSize CPUTime ExitCode
------------ ---------- ---------- ---------- ---------- ---------- ---------- ---------- --------
12345 simulate batch COMPLETED 02:45:30 88:00:00 0:0
12345.batch batch COMPLETED 02:45:30 52428800K 62914560K 02:45:30 0:0
12345.0 simulate COMPLETED 02:45:30 48576512K 58720256K 85:14:30 0:0
Key columns:
- State — COMPLETED, FAILED, TIMEOUT, OUT_OF_MEMORY, CANCELLED
- Elapsed — Actual wall time used
- MaxRSS — Peak memory usage (resident set size)
- CPUTime — Total CPU time consumed (cores × wall time)
- ExitCode — Exit status (0:0 = success)
Common failure states:
- FAILED — Job exited with non-zero exit code
- TIMEOUT — Job exceeded time limit
- OUT_OF_MEMORY — Job exceeded memory limit (exit code 0:137)
- CANCELLED — Job was cancelled by user or admin
Check all jobs since midnight:
$ sacct -a --starttime=midnight --format=JobID,User,Partition,State,Elapsed,ExitCode
JobID User Partition State Elapsed ExitCode
------------ --------- ---------- ---------- ---------- --------
12340 jsmith batch COMPLETED 01:23:45 0:0
12341 kwilson gpu COMPLETED 04:56:12 0:0
12342 pjones long TIMEOUT 24:00:00 0:1
12343 jsmith batch FAILED 00:05:23 1:0
12344 kwilson gpu OUT_OF_ME+ 00:12:34 0:137
squeue –start — Estimated Start Time
For pending jobs, squeue --start shows when SLURM expects the job to start.
$ squeue -j 12348 --start
JOBID PARTITION NAME USER ST START_TIME NODES NODELIST(REASON)
12348 long climate pjones PD 2026-01-17T22:00:00 8 (Priority)
If START_TIME shows “N/A” or a date far in the future, the job may be blocked by resource constraints or priority issues.
Node Information
sinfo — Partition and Node Overview
The sinfo command provides a quick overview of cluster partitions and node states.
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
batch* up 1-00:00:00 85 idle node[005-089]
batch* up 1-00:00:00 10 mix node[090-099]
batch* up 1-00:00:00 4 alloc node[001-004]
batch* up 1-00:00:00 1 down node100
gpu up 1-00:00:00 12 idle gpu[05-16]
gpu up 1-00:00:00 4 alloc gpu[01-04]
highmem up 2-00:00:00 8 idle mem[01-08]
debug up 00:30:00 4 idle node[001-004]
Node states:
- idle — Available, no jobs running
- alloc — Fully allocated to jobs
- mix — Partially allocated (some CPUs free)
- down — Unavailable (hardware issue, admin action)
- drain — Completing current jobs, accepting no new ones
- drng — Draining with jobs still running
Detailed node list:
$ sinfo -N -l
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
gpu01 1 gpu allocated 64 2:16:2 512000 0 1 gpu,a100 none
gpu02 1 gpu allocated 64 2:16:2 512000 0 1 gpu,a100 none
node001 1 batch allocated 32 2:8:2 256000 0 1 (null) none
node100 1 batch down* 32 2:8:2 256000 0 1 (null) Node unresponsive
scontrol show node — Detailed Node Information
Use scontrol show node for comprehensive details about a specific node.
$ scontrol show node node001
NodeName=node001 Arch=x86_64 CoresPerSocket=16
CPUAlloc=32 CPUEfctv=32 CPUTot=32 CPULoad=31.45
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=node001 NodeHostName=node001 Version=23.02.4
OS=Linux 5.15.0-91-generic #101-Ubuntu SMP
RealMemory=256000 AllocMem=256000 FreeMem=12450 Sockets=2 Boards=1
State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=batch,debug
BootTime=2026-01-10T06:00:00 SlurmdStartTime=2026-01-10T06:01:30
LastBusyTime=2026-01-17T10:34:15
CfgTRES=cpu=32,mem=256000M,billing=32
AllocTRES=cpu=32,mem=256000M
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Key fields:
- State — Current node state
- CPUAlloc/CPUTot — CPUs in use vs total available
- CPULoad — Current CPU load (should roughly match CPUAlloc)
- RealMemory/AllocMem/FreeMem — Memory status in MB
- Gres — Generic resources (GPUs, etc.)
- Reason — Why node is down/drained (if applicable)
Check why a node is down:
$ scontrol show node node100 | grep -i "state\|reason"
State=DOWN+NOT_RESPONDING ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Reason=Node unresponsive [slurm@2026-01-17T09:15:00]
sinfo -R — Nodes With Problems
Quickly list all nodes that have issues and their reasons.
$ sinfo -R
REASON USER TIMESTAMP NODELIST
Node unresponsive slurm 2026-01-17T09:15:00 node100
Hardware failure - memory admin 2026-01-16T14:30:00 node055
Scheduled maintenance admin 2026-01-17T06:00:00 node[080-085]
GPU errors detected slurm 2026-01-17T08:45:00 gpu07
List only drained and down nodes:
$ sinfo -t drain,down
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
batch* up 1-00:00:00 1 down node100
batch* up 1-00:00:00 6 drain node[055,080-085]
gpu up 1-00:00:00 1 drain gpu07
Cluster Health
sdiag — Scheduler Statistics
The sdiag command shows scheduler performance metrics and can reveal bottlenecks.
$ sdiag
*******************************************************
sdiag output at 2026-01-17T10:45:00
Data since 2026-01-17T06:00:00
*******************************************************
Server thread count: 10
Agent queue size: 0
Jobs submitted: 1,245
Jobs started: 1,198
Jobs completed: 1,156
Jobs failed: 23
Jobs cancelled: 19
Main schedule statistics (microseconds):
Last cycle: 125,432
Max cycle: 892,156
Total cycles: 4,521
Mean cycle: 145,678
Mean depth cycle: 1,245
Cycles per minute: 15
Backfilling stats
Total backfilled jobs (since last slurm start): 892
Total backfilled jobs (since last stats cycle start): 156
Total backfilled heterogeneous job components: 0
Total cycles: 4,521
Last cycle when: 2026-01-17T10:44:55
Last cycle: 234,567
Max cycle: 1,456,789
Last depth cycle: 1,892
Last depth cycle (try sched): 245
Depth Mean: 1,456
Depth Mean (try depth): 198
Last queue length: 89
Queue length Mean: 76
Key metrics:
- Jobs failed — High number indicates systemic issues
- Mean cycle — Scheduler cycle time (high values = slow scheduling)
- Max cycle — Worst-case scheduler delay
- Agent queue size — Should be near 0 (backlog indicator)
- Total backfilled jobs — Shows backfill scheduler effectiveness
sprio — Job Priority Breakdown
Understand why jobs are scheduled in a particular order.
$ sprio -l
JOBID USER PRIORITY AGE FAIRSHARE JOBSIZE PARTITION QOS
12345 jsmith 100250 1000 50000 250 49000 0
12346 kwilson 98500 500 48000 500 49500 0
12347 jsmith 95000 100 45000 100 49800 0
12348 pjones 85000 2000 33000 1000 49000 0
Priority components:
- AGE — How long job has been waiting (prevents starvation)
- FAIRSHARE — Based on historical usage (heavy users get lower priority)
- JOBSIZE — Smaller jobs may get priority boost
- PARTITION — Partition-specific priority modifier
- QOS — Quality of Service priority adjustment
sreport — Usage Reports
Cluster utilisation:
$ sreport cluster utilization
--------------------------------------------------------------------------------
Cluster Utilization 2026-01-01T00:00:00 - 2026-01-17T10:59:59
Usage reported in CPU Minutes
--------------------------------------------------------------------------------
Cluster Allocated Down PLND Down Idle Reserved Total
--------- ------------- ------------- ------------- ------------- ------------- -------------
mycluster 18,456,789 234,567 0 2,345,678 0 21,037,034
Top users by usage:
$ sreport user top start=2026-01-01 end=2026-01-17 -t percent
--------------------------------------------------------------------------------
Top 10 Users 2026-01-01T00:00:00 - 2026-01-16T23:59:59
Usage reported in CPU Minutes
--------------------------------------------------------------------------------
Cluster Login Proper Name Account Used Energy
--------- --------- --------------- ------------- ------ --------
mycluster jsmith John Smith physics 24.5% 0
mycluster kwilson Kate Wilson ai 18.2% 0
mycluster pjones Paul Jones climate 15.8% 0
mycluster agarcia Ana Garcia chem 12.1% 0
mycluster blee Brian Lee biology 9.4% 0
Troubleshooting
scontrol ping — Controller Status
$ scontrol ping
Slurmctld(primary) at slurmctl01 is UP
Slurmctld(backup) at slurmctl02 is UP
If the controller is down, no jobs can be scheduled and commands will hang or fail.
systemctl status — Daemon Status
Controller daemon:
$ systemctl status slurmctld
● slurmctld.service - Slurm controller daemon
Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled)
Active: active (running) since Wed 2026-01-10 06:00:15 UTC; 1 week 0 days ago
Main PID: 1234 (slurmctld)
Tasks: 15
Memory: 2.4G
CPU: 4h 32min
CGroup: /system.slice/slurmctld.service
└─1234 /usr/sbin/slurmctld -D -s
Jan 17 10:44:55 slurmctl01 slurmctld[1234]: sched: Allocate JobId=12350 NodeList=node[010-012]
Compute node daemon:
$ systemctl status slurmd
● slurmd.service - Slurm node daemon
Loaded: loaded (/lib/systemd/system/slurmd.service; enabled)
Active: active (running) since Wed 2026-01-10 06:01:30 UTC; 1 week 0 days ago
Main PID: 5678 (slurmd)
Tasks: 3
Memory: 45.2M
CPU: 12min
CGroup: /system.slice/slurmd.service
└─5678 /usr/sbin/slurmd -D -s
Jan 17 10:34:15 node001 slurmd[5678]: launch task StepId=12345.0 request from UID 1001
What to look for:
- Active: active (running) — Daemon is healthy
- Active: failed — Daemon has crashed, check logs
- Memory — Controller memory usage (high values may indicate issues)
- Recent log entries — Look for errors or warnings
scontrol show config — Running Configuration
Dump the active SLURM configuration to verify settings.
$ scontrol show config | head -30
Configuration data as of 2026-01-17T10:45:00
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = associations,limits,qos,safe
AccountingStorageHost = slurmdb01
AccountingStorageParameters = (null)
AccountingStoragePort = 6819
AccountingStorageType = accounting_storage/slurmdbd
AccountingStorageUser = slurm
...
Check specific settings:
$ scontrol show config | grep -i preempt
PreemptMode = REQUEUE
PreemptType = preempt/partition_prio
PreemptExemptTime = 00:00:00
$ scontrol show config | grep -i sched
SchedulerParameters = bf_continue,bf_max_job_test=1000,default_queue_depth=1000
SchedulerTimeSlice = 30
SchedulerType = sched/backfill
Quick Reference
| Command | Purpose |
|---|---|
squeue |
View job queue |
squeue -u user |
Jobs for specific user |
squeue -j jobid --start |
Estimated start time |
scontrol show job jobid |
Detailed job info |
sacct -j jobid |
Job stats after completion |
sinfo |
Partition and node overview |
sinfo -R |
Nodes with problems |
sinfo -t drain,down |
List problem nodes only |
scontrol show node nodename |
Detailed node info |
sdiag |
Scheduler statistics |
sprio -l |
Job priority breakdown |
sreport cluster utilization |
Cluster usage stats |
sreport user top |
Top users by usage |
scontrol ping |
Check controller status |
scontrol show config |
Running configuration |
systemctl status slurmctld |
Controller daemon status |
systemctl status slurmd |
Compute node daemon status |
My Thoughts
Effective SLURM diagnostics comes down to knowing which command gives you what information and being able to interpret the output quickly. When something goes wrong:
- Start with
squeueandsinfofor the big picture - Drill down with
scontrol show joborscontrol show node - Check
sacctfor jobs that already completed or failed - Use
sinfo -Rto find problem nodes fast - Monitor
sdiagfor scheduler health
Most issues become obvious once you know where to look. The outputs tell you exactly what’s happening — you just need to know how to read them.
