SLURM Production Partitions: A Practical Guide to Job Scheduling

When managing HPC clusters in production, how you structure your SLURM partitions directly impacts cluster efficiency, user experience, and resource utilisation. A well-designed partition layout ensures the right jobs land on the right hardware, fair scheduling across user groups, and predictable turnaround times.This post covers typical production partition configurations and provides ready-to-use job script templates for each workload type.


What is a SLURM Partition?

A partition in SLURM is a logical grouping of compute nodes with shared attributes and scheduling policies. Think of partitions as queues  users submit jobs to a partition, and SLURM schedules them according to that partition’s rules.

Partitions allow you to:

  • Separate hardware types (GPU nodes, high-memory nodes, standard compute)
  • Set different time limits and priorities
  • Control access for different user groups
  • Apply different preemption and scheduling policies
  • Track usage for billing and chargeback

Typical Production Partition Layout

A typical production cluster uses partitions structured by resource type and job priority:

# slurm.conf partition configuration

PartitionName=batch    Nodes=node[001-100]  Default=YES  MaxTime=24:00:00  State=UP
PartitionName=short    Nodes=node[001-100]  MaxTime=1:00:00   Priority=100  State=UP
PartitionName=long     Nodes=node[001-100]  MaxTime=7-00:00:00  Priority=10  State=UP
PartitionName=gpu      Nodes=gpu[01-16]     MaxTime=24:00:00  State=UP
PartitionName=highmem  Nodes=mem[01-08]     MaxTime=24:00:00  State=UP
PartitionName=debug    Nodes=node[001-004]  MaxTime=00:30:00  Priority=200  State=UP
PartitionName=preempt  Nodes=node[001-100]  MaxTime=24:00:00  PreemptMode=REQUEUE  State=UP

Partition Definitions

batch

The batch partition is the default queue where most standard compute jobs land. It provides a balance between time limits and priority, suitable for the majority of production workloads. If a user submits a job without specifying a partition, it goes here.

short

The short partition is for quick jobs that need fast turnaround. Higher priority ensures these jobs start quickly, but strict time limits (typically 1 hour or less) prevent users from abusing it for long-running work. Ideal for pre-processing, quick analyses, and iterative development.

long

The long partition accommodates multi-day jobs such as climate simulations, molecular dynamics, or large-scale training runs. Lower priority prevents these jobs from blocking shorter work, but they get scheduled during quieter periods or through backfill.

gpu

The gpu partition contains nodes equipped with GPUs (NVIDIA A100s, H100s, etc.). Separating GPU resources ensures expensive accelerators aren’t wasted on CPU-only workloads and allows for GPU-specific scheduling policies and billing.

highmem

The highmem partition groups high-memory nodes (typically 1TB+ RAM) for memory-intensive workloads like genome assembly, large-scale data analysis, or in-memory databases. These nodes are expensive, so isolating them prevents standard jobs from occupying them unnecessarily.

debug

The debug partition provides rapid access for testing and development. Highest priority and very short time limits (15-30 minutes) ensure users can quickly validate their scripts before submitting large production jobs. Usually limited to a small subset of nodes.

preempt

The preempt partition offers opportunistic access to idle resources. Jobs here can be killed and requeued when higher-priority work arrives. Ideal for fault-tolerant workloads that checkpoint regularly. Users get free cycles in exchange for accepting interruption.


Job Script Templates

Below are production-ready job script templates for each partition type. Adjust resource requests to match your specific workload requirements.

Standard Batch Job

Use the batch partition for typical compute workloads with moderate runtime requirements.

#!/bin/bash
#SBATCH --job-name=simulation
#SBATCH --partition=batch
#SBATCH --nodes=2
#SBATCH --ntasks=32
#SBATCH --cpus-per-task=1
#SBATCH --mem=64G
#SBATCH --time=12:00:00
#SBATCH --output=%x_%j.out

module load openmpi/4.1.4
mpirun ./simulate --input data.in

Debug Job

Use the debug partition to quickly test job scripts before submitting large production runs. Keep it short — this partition is for validation, not real work.

#!/bin/bash
#SBATCH --job-name=test_run
#SBATCH --partition=debug
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --time=00:15:00
#SBATCH --output=%x_%j.out

# Quick sanity check before submitting big job
./app --test-mode

GPU Training Job

Use the gpu partition for machine learning training, rendering, or any GPU-accelerated workload. Request specific GPU counts and ensure CUDA environments are loaded.

#!/bin/bash
#SBATCH --job-name=train_model
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --gpus=4
#SBATCH --mem=128G
#SBATCH --time=24:00:00
#SBATCH --output=%x_%j.out

module load cuda/12.2 python/3.11

CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py --epochs 100

High Memory Job

Use the highmem partition for memory-intensive workloads that exceed standard node capacity. Common use cases include genome assembly, large graph processing, and in-memory analytics.

#!/bin/bash
#SBATCH --job-name=genome_assembly
#SBATCH --partition=highmem
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=32
#SBATCH --mem=1T
#SBATCH --time=48:00:00
#SBATCH --output=%x_%j.out

module load assembler/2.1

assembler --threads 32 --memory 900G --input reads.fastq

Long Running Job

Use the long partition for multi-day simulations. Always enable email notifications for job completion or failure, and implement checkpointing for fault tolerance.

#!/bin/bash
#SBATCH --job-name=climate_sim
#SBATCH --partition=long
#SBATCH --nodes=8
#SBATCH --ntasks=256
#SBATCH --time=7-00:00:00
#SBATCH --output=%x_%j.out
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=user@company.com

module load openmpi netcdf

mpirun ./climate_model --checkpoint-interval 6h

Preemptible Backfill Job

Use the preempt partition for opportunistic workloads that can tolerate interruption. The --requeue flag ensures the job restarts if preempted. Your application must support checkpointing and resumption.

#!/bin/bash
#SBATCH --job-name=backfill_work
#SBATCH --partition=preempt
#SBATCH --nodes=4
#SBATCH --ntasks=64
#SBATCH --time=24:00:00
#SBATCH --requeue
#SBATCH --output=%x_%j.out

# Must handle being killed and restarted
./app --checkpoint-dir=/scratch/checkpoints --resume

SBATCH Directive Reference

Common SBATCH directives used across job scripts:

Directive Purpose Example
--job-name Job identifier in queue and logs --job-name=my_simulation
--partition Target partition/queue --partition=gpu
--nodes Number of nodes required --nodes=4
--ntasks Total number of tasks (MPI ranks) --ntasks=64
--cpus-per-task CPU cores per task (for threading) --cpus-per-task=8
--mem Memory per node --mem=128G
--gpus Number of GPUs required --gpus=4
--time Maximum wall time (D-HH:MM:SS) --time=24:00:00
--output Standard output file (%x=job name, %j=job ID) --output=%x_%j.out
--mail-type Email notification triggers --mail-type=END,FAIL
--requeue Requeue job if preempted or failed --requeue

Partition Selection Guide

Partition Typical Use Case Time Limit Priority
debug Testing scripts before production runs 15-30 min Highest
short Quick jobs, preprocessing, iteration 1 hour High
batch Standard compute workloads 24 hours Normal
gpu ML training, rendering, GPU compute 24 hours Normal
highmem Genomics, large datasets, in-memory work 48 hours Normal
long Multi-day simulations 7 days Low
preempt Opportunistic, fault-tolerant workloads 24 hours Lowest

My Thoughts

A well-structured partition layout is the foundation of effective HPC cluster management. By separating resources by type and priority, you ensure:

  • Users get appropriate resources for their workloads
  • Expensive hardware (GPUs, high-memory nodes) is used efficiently
  • Short jobs don’t get stuck behind long-running simulations
  • Testing and development has fast turnaround
  • Usage can be tracked and billed accurately

Start with the templates above and adjust time limits, priorities, and access controls to match your organisation’s requirements. As your cluster grows, you can add specialised partitions for specific hardware or user groups.

Leave a Reply

Your email address will not be published. Required fields are marked *