{"id":2205,"date":"2026-01-17T09:01:02","date_gmt":"2026-01-17T09:01:02","guid":{"rendered":"https:\/\/nicktailor.com\/tech-blog\/?p=2205"},"modified":"2026-01-17T09:01:03","modified_gmt":"2026-01-17T09:01:03","slug":"nick-tailor-notes-essential-slurm-diagnostic-commands-outputs-and-what-they-mean","status":"publish","type":"post","link":"https:\/\/nicktailor.com\/tech-blog\/nick-tailor-notes-essential-slurm-diagnostic-commands-outputs-and-what-they-mean\/","title":{"rendered":"Nick Tailor Notes&#8230;Essential SLURM Diagnostic Commands: Outputs and What They Mean"},"content":{"rendered":"\n<article>\n<p>\nWhen managing HPC clusters, knowing how to quickly diagnose job issues, node problems, and\ncluster health is essential. SLURM provides a comprehensive set of commands for this purpose,\nbut understanding the output is just as important as knowing which command to run.\n<\/p>\n\n<p>\nThis post covers the most common SLURM diagnostic commands, their expected outputs, and how\nto interpret what you&#8217;re seeing.\n<\/p>\n\n<hr \/>\n\n<h2>Job Information<\/h2>\n\n<h3>squeue \u2014 View Job Queue<\/h3>\n\n<p>\nThe <code>squeue<\/code> command shows jobs currently in the queue (running and pending).\n<\/p>\n\n<pre><code>$ squeue\nJOBID   PARTITION   NAME        USER      ST   TIME       NODES  NODELIST(REASON)\n12345   batch       simulate    jsmith    R    2:34:15    4      node[001-004]\n12346   gpu         train_ml    kwilson   R    0:45:22    1      gpu01\n12347   batch       analysis    jsmith    PD   0:00       2      (Resources)\n12348   long        climate     pjones    PD   0:00       8      (Priority)\n<\/code><\/pre>\n\n<p><strong>Key columns:<\/strong><\/p>\n<ul>\n    <li><strong>ST (State)<\/strong> \u2014 R=Running, PD=Pending, CG=Completing, F=Failed<\/li>\n    <li><strong>TIME<\/strong> \u2014 How long the job has been running<\/li>\n    <li><strong>NODELIST(REASON)<\/strong> \u2014 Which nodes it&#8217;s on, or why it&#8217;s pending<\/li>\n<\/ul>\n\n<p><strong>Common pending reasons:<\/strong><\/p>\n<ul>\n    <li><strong>(Resources)<\/strong> \u2014 Waiting for requested resources to become available<\/li>\n    <li><strong>(Priority)<\/strong> \u2014 Other jobs have higher priority<\/li>\n    <li><strong>(ReqNodeNotAvail)<\/strong> \u2014 Requested nodes are down or reserved<\/li>\n    <li><strong>(QOSMaxJobsPerUserLimit)<\/strong> \u2014 User hit their job limit<\/li>\n    <li><strong>(Dependency)<\/strong> \u2014 Waiting for another job to complete<\/li>\n<\/ul>\n\n<h4>Filter by user or partition:<\/h4>\n\n<pre><code>$ squeue -u jsmith\nJOBID   PARTITION   NAME        USER      ST   TIME       NODES  NODELIST(REASON)\n12345   batch       simulate    jsmith    R    2:34:15    4      node[001-004]\n12347   batch       analysis    jsmith    PD   0:00       2      (Resources)\n\n$ squeue -p gpu\nJOBID   PARTITION   NAME        USER      ST   TIME       NODES  NODELIST(REASON)\n12346   gpu         train_ml    kwilson   R    0:45:22    1      gpu01\n<\/code><\/pre>\n\n<hr \/>\n\n<h3>scontrol show job \u2014 Detailed Job Information<\/h3>\n\n<p>\nUse <code>scontrol show job<\/code> to get comprehensive details about a specific job.\n<\/p>\n\n<pre><code>$ scontrol show job 12345\nJobId=12345 JobName=simulate\n   UserId=jsmith(1001) GroupId=research(100) MCS_label=N\/A\n   Priority=4294901720 Nice=0 Account=physics QOS=normal\n   JobState=RUNNING Reason=None Dependency=(null)\n   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0\n   RunTime=02:34:15 TimeLimit=12:00:00 TimeMin=N\/A\n   SubmitTime=2026-01-17T08:00:00 EligibleTime=2026-01-17T08:00:05\n   AccrueTime=2026-01-17T08:00:05\n   StartTime=2026-01-17T08:00:10 EndTime=2026-01-17T20:00:10 Deadline=N\/A\n   PreemptEligibleTime=2026-01-17T08:00:10 PreemptTime=None\n   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2026-01-17T08:00:05\n   Partition=batch AllocNode:Sid=login01:54321\n   ReqNodeList=(null) ExcNodeList=(null)\n   NodeList=node[001-004]\n   BatchHost=node001\n   NumNodes=4 NumCPUs=32 NumTasks=32 CPUs\/Task=1 ReqB:S:C:T=0:0:*:*\n   TRES=cpu=32,mem=256G,node=4,billing=32\n   Socks\/Node=* NtasksPerN:B:S:C=8:0:*:* CoreSpec=*\n   MinCPUsNode=8 MinMemoryNode=64G MinTmpDiskNode=0\n   Features=(null) DelayBoot=00:00:00\n   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)\n   Command=\/home\/jsmith\/jobs\/simulate.sh\n   WorkDir=\/home\/jsmith\/jobs\n   StdErr=\/home\/jsmith\/jobs\/simulate_12345.out\n   StdIn=\/dev\/null\n   StdOut=\/home\/jsmith\/jobs\/simulate_12345.out\n   Power=\n<\/code><\/pre>\n\n<p><strong>Key fields to check:<\/strong><\/p>\n<ul>\n    <li><strong>JobState<\/strong> \u2014 Current state (RUNNING, PENDING, FAILED, COMPLETED, TIMEOUT)<\/li>\n    <li><strong>Reason<\/strong> \u2014 Why job is pending or failed<\/li>\n    <li><strong>Priority<\/strong> \u2014 Job priority (higher = scheduled sooner)<\/li>\n    <li><strong>RunTime vs TimeLimit<\/strong> \u2014 How long it&#8217;s run vs maximum allowed<\/li>\n    <li><strong>NodeList<\/strong> \u2014 Which nodes the job is running on<\/li>\n    <li><strong>ExitCode<\/strong> \u2014 Exit status (0:0 = success, non-zero = failure)<\/li>\n    <li><strong>TRES<\/strong> \u2014 Resources allocated (CPUs, memory, GPUs)<\/li>\n<\/ul>\n\n<hr \/>\n\n<h3>sacct \u2014 Job Statistics After Completion<\/h3>\n\n<p>\nUse <code>sacct<\/code> to view resource usage after a job completes. This is essential for\nunderstanding why jobs failed or ran slowly.\n<\/p>\n\n<pre><code>$ sacct -j 12345 --format=JobID,JobName,Partition,State,Elapsed,MaxRSS,MaxVMSize,CPUTime,ExitCode\nJobID           JobName    Partition  State      Elapsed     MaxRSS   MaxVMSize    CPUTime  ExitCode\n------------ ---------- ---------- ---------- ---------- ---------- ---------- ---------- --------\n12345          simulate      batch  COMPLETED   02:45:30                         88:00:00      0:0\n12345.batch       batch             COMPLETED   02:45:30   52428800K  62914560K  02:45:30      0:0\n12345.0        simulate             COMPLETED   02:45:30   48576512K  58720256K  85:14:30      0:0\n<\/code><\/pre>\n\n<p><strong>Key columns:<\/strong><\/p>\n<ul>\n    <li><strong>State<\/strong> \u2014 COMPLETED, FAILED, TIMEOUT, OUT_OF_MEMORY, CANCELLED<\/li>\n    <li><strong>Elapsed<\/strong> \u2014 Actual wall time used<\/li>\n    <li><strong>MaxRSS<\/strong> \u2014 Peak memory usage (resident set size)<\/li>\n    <li><strong>CPUTime<\/strong> \u2014 Total CPU time consumed (cores \u00d7 wall time)<\/li>\n    <li><strong>ExitCode<\/strong> \u2014 Exit status (0:0 = success)<\/li>\n<\/ul>\n\n<p><strong>Common failure states:<\/strong><\/p>\n<ul>\n    <li><strong>FAILED<\/strong> \u2014 Job exited with non-zero exit code<\/li>\n    <li><strong>TIMEOUT<\/strong> \u2014 Job exceeded time limit<\/li>\n    <li><strong>OUT_OF_MEMORY<\/strong> \u2014 Job exceeded memory limit (exit code 0:137)<\/li>\n    <li><strong>CANCELLED<\/strong> \u2014 Job was cancelled by user or admin<\/li>\n<\/ul>\n\n<h4>Check all jobs since midnight:<\/h4>\n\n<pre><code>$ sacct -a --starttime=midnight --format=JobID,User,Partition,State,Elapsed,ExitCode\nJobID             User  Partition      State    Elapsed  ExitCode\n------------ --------- ---------- ---------- ---------- --------\n12340           jsmith      batch  COMPLETED   01:23:45      0:0\n12341          kwilson        gpu  COMPLETED   04:56:12      0:0\n12342           pjones       long    TIMEOUT   24:00:00      0:1\n12343           jsmith      batch     FAILED   00:05:23      1:0\n12344          kwilson        gpu OUT_OF_ME+   00:12:34      0:137\n<\/code><\/pre>\n\n<hr \/>\n\n<h3>squeue &#8211;start \u2014 Estimated Start Time<\/h3>\n\n<p>\nFor pending jobs, <code>squeue --start<\/code> shows when SLURM expects the job to start.\n<\/p>\n\n<pre><code>$ squeue -j 12348 --start\nJOBID   PARTITION   NAME        USER      ST   START_TIME           NODES  NODELIST(REASON)\n12348   long        climate     pjones    PD   2026-01-17T22:00:00  8      (Priority)\n<\/code><\/pre>\n\n<p>\nIf START_TIME shows &#8220;N\/A&#8221; or a date far in the future, the job may be blocked by resource\nconstraints or priority issues.\n<\/p>\n\n<hr \/>\n\n<h2>Node Information<\/h2>\n\n<h3>sinfo \u2014 Partition and Node Overview<\/h3>\n\n<p>\nThe <code>sinfo<\/code> command provides a quick overview of cluster partitions and node states.\n<\/p>\n\n<pre><code>$ sinfo\nPARTITION   AVAIL  TIMELIMIT  NODES  STATE  NODELIST\nbatch*      up     1-00:00:00    85  idle   node[005-089]\nbatch*      up     1-00:00:00    10  mix    node[090-099]\nbatch*      up     1-00:00:00     4  alloc  node[001-004]\nbatch*      up     1-00:00:00     1  down   node100\ngpu         up     1-00:00:00    12  idle   gpu[05-16]\ngpu         up     1-00:00:00     4  alloc  gpu[01-04]\nhighmem     up     2-00:00:00     8  idle   mem[01-08]\ndebug       up     00:30:00       4  idle   node[001-004]\n<\/code><\/pre>\n\n<p><strong>Node states:<\/strong><\/p>\n<ul>\n    <li><strong>idle<\/strong> \u2014 Available, no jobs running<\/li>\n    <li><strong>alloc<\/strong> \u2014 Fully allocated to jobs<\/li>\n    <li><strong>mix<\/strong> \u2014 Partially allocated (some CPUs free)<\/li>\n    <li><strong>down<\/strong> \u2014 Unavailable (hardware issue, admin action)<\/li>\n    <li><strong>drain<\/strong> \u2014 Completing current jobs, accepting no new ones<\/li>\n    <li><strong>drng<\/strong> \u2014 Draining with jobs still running<\/li>\n<\/ul>\n\n<h4>Detailed node list:<\/h4>\n\n<pre><code>$ sinfo -N -l\nNODELIST    NODES  PARTITION  STATE       CPUS  S:C:T  MEMORY  TMP_DISK  WEIGHT  AVAIL_FE  REASON\ngpu01           1  gpu        allocated     64  2:16:2  512000         0       1  gpu,a100  none\ngpu02           1  gpu        allocated     64  2:16:2  512000         0       1  gpu,a100  none\nnode001         1  batch      allocated     32  2:8:2   256000         0       1  (null)    none\nnode100         1  batch      down*         32  2:8:2   256000         0       1  (null)    Node unresponsive\n<\/code><\/pre>\n\n<hr \/>\n\n<h3>scontrol show node \u2014 Detailed Node Information<\/h3>\n\n<p>\nUse <code>scontrol show node<\/code> for comprehensive details about a specific node.\n<\/p>\n\n<pre><code>$ scontrol show node node001\nNodeName=node001 Arch=x86_64 CoresPerSocket=16\n   CPUAlloc=32 CPUEfctv=32 CPUTot=32 CPULoad=31.45\n   AvailableFeatures=(null)\n   ActiveFeatures=(null)\n   Gres=(null)\n   NodeAddr=node001 NodeHostName=node001 Version=23.02.4\n   OS=Linux 5.15.0-91-generic #101-Ubuntu SMP\n   RealMemory=256000 AllocMem=256000 FreeMem=12450 Sockets=2 Boards=1\n   State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N\/A MCS_label=N\/A\n   Partitions=batch,debug\n   BootTime=2026-01-10T06:00:00 SlurmdStartTime=2026-01-10T06:01:30\n   LastBusyTime=2026-01-17T10:34:15\n   CfgTRES=cpu=32,mem=256000M,billing=32\n   AllocTRES=cpu=32,mem=256000M\n   CapWatts=n\/a\n   CurrentWatts=0 AveWatts=0\n   ExtSensorsJoules=n\/s ExtSensorsWatts=0 ExtSensorsTemp=n\/s\n<\/code><\/pre>\n\n<p><strong>Key fields:<\/strong><\/p>\n<ul>\n    <li><strong>State<\/strong> \u2014 Current node state<\/li>\n    <li><strong>CPUAlloc\/CPUTot<\/strong> \u2014 CPUs in use vs total available<\/li>\n    <li><strong>CPULoad<\/strong> \u2014 Current CPU load (should roughly match CPUAlloc)<\/li>\n    <li><strong>RealMemory\/AllocMem\/FreeMem<\/strong> \u2014 Memory status in MB<\/li>\n    <li><strong>Gres<\/strong> \u2014 Generic resources (GPUs, etc.)<\/li>\n    <li><strong>Reason<\/strong> \u2014 Why node is down\/drained (if applicable)<\/li>\n<\/ul>\n\n<h4>Check why a node is down:<\/h4>\n\n<pre><code>$ scontrol show node node100 | grep -i \"state\\|reason\"\n   State=DOWN+NOT_RESPONDING ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N\/A MCS_label=N\/A\n   Reason=Node unresponsive [slurm@2026-01-17T09:15:00]\n<\/code><\/pre>\n\n<hr \/>\n\n<h3>sinfo -R \u2014 Nodes With Problems<\/h3>\n\n<p>\nQuickly list all nodes that have issues and their reasons.\n<\/p>\n\n<pre><code>$ sinfo -R\nREASON                              USER        TIMESTAMP           NODELIST\nNode unresponsive                   slurm       2026-01-17T09:15:00 node100\nHardware failure - memory           admin       2026-01-16T14:30:00 node055\nScheduled maintenance               admin       2026-01-17T06:00:00 node[080-085]\nGPU errors detected                 slurm       2026-01-17T08:45:00 gpu07\n<\/code><\/pre>\n\n<h4>List only drained and down nodes:<\/h4>\n\n<pre><code>$ sinfo -t drain,down\nPARTITION   AVAIL  TIMELIMIT  NODES  STATE  NODELIST\nbatch*      up     1-00:00:00     1  down   node100\nbatch*      up     1-00:00:00     6  drain  node[055,080-085]\ngpu         up     1-00:00:00     1  drain  gpu07\n<\/code><\/pre>\n\n<hr \/>\n\n<h2>Cluster Health<\/h2>\n\n<h3>sdiag \u2014 Scheduler Statistics<\/h3>\n\n<p>\nThe <code>sdiag<\/code> command shows scheduler performance metrics and can reveal bottlenecks.\n<\/p>\n\n<pre><code>$ sdiag\n*******************************************************\nsdiag output at 2026-01-17T10:45:00\nData since      2026-01-17T06:00:00\n*******************************************************\nServer thread count: 10\nAgent queue size:    0\n\nJobs submitted: 1,245\nJobs started:   1,198\nJobs completed: 1,156\nJobs failed:    23\nJobs cancelled: 19\n\nMain schedule statistics (microseconds):\n    Last cycle:   125,432\n    Max cycle:    892,156\n    Total cycles: 4,521\n    Mean cycle:   145,678\n    Mean depth cycle:  1,245\n    Cycles per minute: 15\n\nBackfilling stats\n    Total backfilled jobs (since last slurm start): 892\n    Total backfilled jobs (since last stats cycle start): 156\n    Total backfilled heterogeneous job components: 0\n    Total cycles: 4,521\n    Last cycle when: 2026-01-17T10:44:55\n    Last cycle: 234,567\n    Max cycle:  1,456,789\n    Last depth cycle: 1,892\n    Last depth cycle (try sched): 245\n    Depth Mean: 1,456\n    Depth Mean (try depth): 198\n    Last queue length: 89\n    Queue length Mean: 76\n<\/code><\/pre>\n\n<p><strong>Key metrics:<\/strong><\/p>\n<ul>\n    <li><strong>Jobs failed<\/strong> \u2014 High number indicates systemic issues<\/li>\n    <li><strong>Mean cycle<\/strong> \u2014 Scheduler cycle time (high values = slow scheduling)<\/li>\n    <li><strong>Max cycle<\/strong> \u2014 Worst-case scheduler delay<\/li>\n    <li><strong>Agent queue size<\/strong> \u2014 Should be near 0 (backlog indicator)<\/li>\n    <li><strong>Total backfilled jobs<\/strong> \u2014 Shows backfill scheduler effectiveness<\/li>\n<\/ul>\n\n<hr \/>\n\n<h3>sprio \u2014 Job Priority Breakdown<\/h3>\n\n<p>\nUnderstand why jobs are scheduled in a particular order.\n<\/p>\n\n<pre><code>$ sprio -l\nJOBID     USER      PRIORITY    AGE       FAIRSHARE  JOBSIZE   PARTITION  QOS\n12345     jsmith    100250      1000      50000      250       49000      0\n12346     kwilson   98500       500       48000      500       49500      0\n12347     jsmith    95000       100       45000      100       49800      0\n12348     pjones    85000       2000      33000      1000      49000      0\n<\/code><\/pre>\n\n<p><strong>Priority components:<\/strong><\/p>\n<ul>\n    <li><strong>AGE<\/strong> \u2014 How long job has been waiting (prevents starvation)<\/li>\n    <li><strong>FAIRSHARE<\/strong> \u2014 Based on historical usage (heavy users get lower priority)<\/li>\n    <li><strong>JOBSIZE<\/strong> \u2014 Smaller jobs may get priority boost<\/li>\n    <li><strong>PARTITION<\/strong> \u2014 Partition-specific priority modifier<\/li>\n    <li><strong>QOS<\/strong> \u2014 Quality of Service priority adjustment<\/li>\n<\/ul>\n\n<hr \/>\n\n<h3>sreport \u2014 Usage Reports<\/h3>\n\n<h4>Cluster utilisation:<\/h4>\n\n<pre><code>$ sreport cluster utilization\n--------------------------------------------------------------------------------\nCluster Utilization 2026-01-01T00:00:00 - 2026-01-17T10:59:59\nUsage reported in CPU Minutes\n--------------------------------------------------------------------------------\n  Cluster     Allocated          Down     PLND Down          Idle    Reserved       Total \n--------- ------------- ------------- ------------- ------------- ------------- -------------\n  mycluster     18,456,789       234,567             0     2,345,678             0    21,037,034\n<\/code><\/pre>\n\n<h4>Top users by usage:<\/h4>\n\n<pre><code>$ sreport user top start=2026-01-01 end=2026-01-17 -t percent\n--------------------------------------------------------------------------------\nTop 10 Users 2026-01-01T00:00:00 - 2026-01-16T23:59:59\nUsage reported in CPU Minutes\n--------------------------------------------------------------------------------\n  Cluster     Login     Proper Name       Account   Used   Energy\n--------- --------- --------------- ------------- ------ --------\nmycluster    jsmith    John Smith         physics  24.5%        0\nmycluster   kwilson    Kate Wilson             ai  18.2%        0\nmycluster    pjones    Paul Jones        climate  15.8%        0\nmycluster    agarcia   Ana Garcia          chem   12.1%        0\nmycluster    blee      Brian Lee        biology    9.4%        0\n<\/code><\/pre>\n\n<hr \/>\n\n<h2>Troubleshooting<\/h2>\n\n<h3>scontrol ping \u2014 Controller Status<\/h3>\n\n<pre><code>$ scontrol ping\nSlurmctld(primary) at slurmctl01 is UP\nSlurmctld(backup) at slurmctl02 is UP\n<\/code><\/pre>\n\n<p>\nIf the controller is down, no jobs can be scheduled and commands will hang or fail.\n<\/p>\n\n<hr \/>\n\n<h3>systemctl status \u2014 Daemon Status<\/h3>\n\n<h4>Controller daemon:<\/h4>\n\n<pre><code>$ systemctl status slurmctld\n\u25cf slurmctld.service - Slurm controller daemon\n     Loaded: loaded (\/lib\/systemd\/system\/slurmctld.service; enabled)\n     Active: active (running) since Wed 2026-01-10 06:00:15 UTC; 1 week 0 days ago\n   Main PID: 1234 (slurmctld)\n      Tasks: 15\n     Memory: 2.4G\n        CPU: 4h 32min\n     CGroup: \/system.slice\/slurmctld.service\n             \u2514\u25001234 \/usr\/sbin\/slurmctld -D -s\n\nJan 17 10:44:55 slurmctl01 slurmctld[1234]: sched: Allocate JobId=12350 NodeList=node[010-012]\n<\/code><\/pre>\n\n<h4>Compute node daemon:<\/h4>\n\n<pre><code>$ systemctl status slurmd\n\u25cf slurmd.service - Slurm node daemon\n     Loaded: loaded (\/lib\/systemd\/system\/slurmd.service; enabled)\n     Active: active (running) since Wed 2026-01-10 06:01:30 UTC; 1 week 0 days ago\n   Main PID: 5678 (slurmd)\n      Tasks: 3\n     Memory: 45.2M\n        CPU: 12min\n     CGroup: \/system.slice\/slurmd.service\n             \u2514\u25005678 \/usr\/sbin\/slurmd -D -s\n\nJan 17 10:34:15 node001 slurmd[5678]: launch task StepId=12345.0 request from UID 1001\n<\/code><\/pre>\n\n<p><strong>What to look for:<\/strong><\/p>\n<ul>\n    <li><strong>Active: active (running)<\/strong> \u2014 Daemon is healthy<\/li>\n    <li><strong>Active: failed<\/strong> \u2014 Daemon has crashed, check logs<\/li>\n    <li><strong>Memory<\/strong> \u2014 Controller memory usage (high values may indicate issues)<\/li>\n    <li><strong>Recent log entries<\/strong> \u2014 Look for errors or warnings<\/li>\n<\/ul>\n\n<hr \/>\n\n<h3>scontrol show config \u2014 Running Configuration<\/h3>\n\n<p>\nDump the active SLURM configuration to verify settings.\n<\/p>\n\n<pre><code>$ scontrol show config | head -30\nConfiguration data as of 2026-01-17T10:45:00\nAccountingStorageBackupHost = (null)\nAccountingStorageEnforce    = associations,limits,qos,safe\nAccountingStorageHost       = slurmdb01\nAccountingStorageParameters = (null)\nAccountingStoragePort       = 6819\nAccountingStorageType       = accounting_storage\/slurmdbd\nAccountingStorageUser       = slurm\n...\n<\/code><\/pre>\n\n<h4>Check specific settings:<\/h4>\n\n<pre><code>$ scontrol show config | grep -i preempt\nPreemptMode             = REQUEUE\nPreemptType             = preempt\/partition_prio\nPreemptExemptTime       = 00:00:00\n\n$ scontrol show config | grep -i sched\nSchedulerParameters     = bf_continue,bf_max_job_test=1000,default_queue_depth=1000\nSchedulerTimeSlice      = 30\nSchedulerType           = sched\/backfill\n<\/code><\/pre>\n\n<hr \/>\n\n<h2>Quick Reference<\/h2>\n\n<table>\n    <thead>\n        <tr>\n            <th>Command<\/th>\n            <th>Purpose<\/th>\n        <\/tr>\n    <\/thead>\n    <tbody>\n        <tr>\n            <td><code>squeue<\/code><\/td>\n            <td>View job queue<\/td>\n        <\/tr>\n        <tr>\n            <td><code>squeue -u user<\/code><\/td>\n            <td>Jobs for specific user<\/td>\n        <\/tr>\n        <tr>\n            <td><code>squeue -j jobid --start<\/code><\/td>\n            <td>Estimated start time<\/td>\n        <\/tr>\n        <tr>\n            <td><code>scontrol show job jobid<\/code><\/td>\n            <td>Detailed job info<\/td>\n        <\/tr>\n        <tr>\n            <td><code>sacct -j jobid<\/code><\/td>\n            <td>Job stats after completion<\/td>\n        <\/tr>\n        <tr>\n            <td><code>sinfo<\/code><\/td>\n            <td>Partition and node overview<\/td>\n        <\/tr>\n        <tr>\n            <td><code>sinfo -R<\/code><\/td>\n            <td>Nodes with problems<\/td>\n        <\/tr>\n        <tr>\n            <td><code>sinfo -t drain,down<\/code><\/td>\n            <td>List problem nodes only<\/td>\n        <\/tr>\n        <tr>\n            <td><code>scontrol show node nodename<\/code><\/td>\n            <td>Detailed node info<\/td>\n        <\/tr>\n        <tr>\n            <td><code>sdiag<\/code><\/td>\n            <td>Scheduler statistics<\/td>\n        <\/tr>\n        <tr>\n            <td><code>sprio -l<\/code><\/td>\n            <td>Job priority breakdown<\/td>\n        <\/tr>\n        <tr>\n            <td><code>sreport cluster utilization<\/code><\/td>\n            <td>Cluster usage stats<\/td>\n        <\/tr>\n        <tr>\n            <td><code>sreport user top<\/code><\/td>\n            <td>Top users by usage<\/td>\n        <\/tr>\n        <tr>\n            <td><code>scontrol ping<\/code><\/td>\n            <td>Check controller status<\/td>\n        <\/tr>\n        <tr>\n            <td><code>scontrol show config<\/code><\/td>\n            <td>Running configuration<\/td>\n        <\/tr>\n        <tr>\n            <td><code>systemctl status slurmctld<\/code><\/td>\n            <td>Controller daemon status<\/td>\n        <\/tr>\n        <tr>\n            <td><code>systemctl status slurmd<\/code><\/td>\n            <td>Compute node daemon status<\/td>\n        <\/tr>\n    <\/tbody>\n<\/table>\n\n<hr \/>\n\n<h2>My Thoughts<\/h2>\n\n<p>\nEffective SLURM diagnostics comes down to knowing which command gives you what information\nand being able to interpret the output quickly. When something goes wrong:\n<\/p>\n\n<ul>\n    <li>Start with <code>squeue<\/code> and <code>sinfo<\/code> for the big picture<\/li>\n    <li>Drill down with <code>scontrol show job<\/code> or <code>scontrol show node<\/code><\/li>\n    <li>Check <code>sacct<\/code> for jobs that already completed or failed<\/li>\n    <li>Use <code>sinfo -R<\/code> to find problem nodes fast<\/li>\n    <li>Monitor <code>sdiag<\/code> for scheduler health<\/li>\n<\/ul>\n\n<p>\nMost issues become obvious once you know where to look. The outputs tell you exactly what&#8217;s\nhappening \u2014 you just need to know how to read them.\n<\/p>\n\n<\/article>\n","protected":false},"excerpt":{"rendered":"<p>When managing HPC clusters, knowing how to quickly diagnose job issues, node problems, and cluster health is essential. SLURM provides a comprehensive set of commands for this purpose, but understanding the output is just as important as knowing which command to run. This post covers the most common SLURM diagnostic commands, their expected outputs, and how to interpret what you&#8217;re<a href=\"https:\/\/nicktailor.com\/tech-blog\/nick-tailor-notes-essential-slurm-diagnostic-commands-outputs-and-what-they-mean\/\" class=\"read-more\">Read More &#8230;<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[143],"tags":[],"class_list":["post-2205","post","type-post","status-publish","format-standard","hentry","category-hpc"],"_links":{"self":[{"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/posts\/2205","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/comments?post=2205"}],"version-history":[{"count":1,"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/posts\/2205\/revisions"}],"predecessor-version":[{"id":2206,"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/posts\/2205\/revisions\/2206"}],"wp:attachment":[{"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/media?parent=2205"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/categories?post=2205"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/tags?post=2205"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}