What is a SLURM Partition?
A partition in SLURM is a logical grouping of compute nodes with shared attributes and scheduling policies. Think of partitions as queues users submit jobs to a partition, and SLURM schedules them according to that partition’s rules.
Partitions allow you to:
- Separate hardware types (GPU nodes, high-memory nodes, standard compute)
- Set different time limits and priorities
- Control access for different user groups
- Apply different preemption and scheduling policies
- Track usage for billing and chargeback
Typical Production Partition Layout
A typical production cluster uses partitions structured by resource type and job priority:
# slurm.conf partition configuration
PartitionName=batch Nodes=node[001-100] Default=YES MaxTime=24:00:00 State=UP
PartitionName=short Nodes=node[001-100] MaxTime=1:00:00 Priority=100 State=UP
PartitionName=long Nodes=node[001-100] MaxTime=7-00:00:00 Priority=10 State=UP
PartitionName=gpu Nodes=gpu[01-16] MaxTime=24:00:00 State=UP
PartitionName=highmem Nodes=mem[01-08] MaxTime=24:00:00 State=UP
PartitionName=debug Nodes=node[001-004] MaxTime=00:30:00 Priority=200 State=UP
PartitionName=preempt Nodes=node[001-100] MaxTime=24:00:00 PreemptMode=REQUEUE State=UP
Partition Definitions
batch
The batch partition is the default queue where most standard compute jobs land. It provides a balance between time limits and priority, suitable for the majority of production workloads. If a user submits a job without specifying a partition, it goes here.
short
The short partition is for quick jobs that need fast turnaround. Higher priority ensures these jobs start quickly, but strict time limits (typically 1 hour or less) prevent users from abusing it for long-running work. Ideal for pre-processing, quick analyses, and iterative development.
long
The long partition accommodates multi-day jobs such as climate simulations, molecular dynamics, or large-scale training runs. Lower priority prevents these jobs from blocking shorter work, but they get scheduled during quieter periods or through backfill.
gpu
The gpu partition contains nodes equipped with GPUs (NVIDIA A100s, H100s, etc.). Separating GPU resources ensures expensive accelerators aren’t wasted on CPU-only workloads and allows for GPU-specific scheduling policies and billing.
highmem
The highmem partition groups high-memory nodes (typically 1TB+ RAM) for memory-intensive workloads like genome assembly, large-scale data analysis, or in-memory databases. These nodes are expensive, so isolating them prevents standard jobs from occupying them unnecessarily.
debug
The debug partition provides rapid access for testing and development. Highest priority and very short time limits (15-30 minutes) ensure users can quickly validate their scripts before submitting large production jobs. Usually limited to a small subset of nodes.
preempt
The preempt partition offers opportunistic access to idle resources. Jobs here can be killed and requeued when higher-priority work arrives. Ideal for fault-tolerant workloads that checkpoint regularly. Users get free cycles in exchange for accepting interruption.
Job Script Templates
Below are production-ready job script templates for each partition type. Adjust resource requests to match your specific workload requirements.
Standard Batch Job
Use the batch partition for typical compute workloads with moderate runtime requirements.
#!/bin/bash
#SBATCH --job-name=simulation
#SBATCH --partition=batch
#SBATCH --nodes=2
#SBATCH --ntasks=32
#SBATCH --cpus-per-task=1
#SBATCH --mem=64G
#SBATCH --time=12:00:00
#SBATCH --output=%x_%j.out
module load openmpi/4.1.4
mpirun ./simulate --input data.in
Debug Job
Use the debug partition to quickly test job scripts before submitting large production runs. Keep it short — this partition is for validation, not real work.
#!/bin/bash
#SBATCH --job-name=test_run
#SBATCH --partition=debug
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --time=00:15:00
#SBATCH --output=%x_%j.out
# Quick sanity check before submitting big job
./app --test-mode
GPU Training Job
Use the gpu partition for machine learning training, rendering, or any GPU-accelerated workload. Request specific GPU counts and ensure CUDA environments are loaded.
#!/bin/bash
#SBATCH --job-name=train_model
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --gpus=4
#SBATCH --mem=128G
#SBATCH --time=24:00:00
#SBATCH --output=%x_%j.out
module load cuda/12.2 python/3.11
CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py --epochs 100
High Memory Job
Use the highmem partition for memory-intensive workloads that exceed standard node capacity. Common use cases include genome assembly, large graph processing, and in-memory analytics.
#!/bin/bash
#SBATCH --job-name=genome_assembly
#SBATCH --partition=highmem
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=32
#SBATCH --mem=1T
#SBATCH --time=48:00:00
#SBATCH --output=%x_%j.out
module load assembler/2.1
assembler --threads 32 --memory 900G --input reads.fastq
Long Running Job
Use the long partition for multi-day simulations. Always enable email notifications for job completion or failure, and implement checkpointing for fault tolerance.
#!/bin/bash
#SBATCH --job-name=climate_sim
#SBATCH --partition=long
#SBATCH --nodes=8
#SBATCH --ntasks=256
#SBATCH --time=7-00:00:00
#SBATCH --output=%x_%j.out
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=user@company.com
module load openmpi netcdf
mpirun ./climate_model --checkpoint-interval 6h
Preemptible Backfill Job
Use the preempt partition for opportunistic workloads that can tolerate interruption. The --requeue flag ensures the job restarts if preempted. Your application must support checkpointing and resumption.
#!/bin/bash
#SBATCH --job-name=backfill_work
#SBATCH --partition=preempt
#SBATCH --nodes=4
#SBATCH --ntasks=64
#SBATCH --time=24:00:00
#SBATCH --requeue
#SBATCH --output=%x_%j.out
# Must handle being killed and restarted
./app --checkpoint-dir=/scratch/checkpoints --resume
SBATCH Directive Reference
Common SBATCH directives used across job scripts:
| Directive | Purpose | Example |
|---|---|---|
--job-name |
Job identifier in queue and logs | --job-name=my_simulation |
--partition |
Target partition/queue | --partition=gpu |
--nodes |
Number of nodes required | --nodes=4 |
--ntasks |
Total number of tasks (MPI ranks) | --ntasks=64 |
--cpus-per-task |
CPU cores per task (for threading) | --cpus-per-task=8 |
--mem |
Memory per node | --mem=128G |
--gpus |
Number of GPUs required | --gpus=4 |
--time |
Maximum wall time (D-HH:MM:SS) | --time=24:00:00 |
--output |
Standard output file (%x=job name, %j=job ID) | --output=%x_%j.out |
--mail-type |
Email notification triggers | --mail-type=END,FAIL |
--requeue |
Requeue job if preempted or failed | --requeue |
Partition Selection Guide
| Partition | Typical Use Case | Time Limit | Priority |
|---|---|---|---|
| debug | Testing scripts before production runs | 15-30 min | Highest |
| short | Quick jobs, preprocessing, iteration | 1 hour | High |
| batch | Standard compute workloads | 24 hours | Normal |
| gpu | ML training, rendering, GPU compute | 24 hours | Normal |
| highmem | Genomics, large datasets, in-memory work | 48 hours | Normal |
| long | Multi-day simulations | 7 days | Low |
| preempt | Opportunistic, fault-tolerant workloads | 24 hours | Lowest |
My Thoughts
A well-structured partition layout is the foundation of effective HPC cluster management. By separating resources by type and priority, you ensure:
- Users get appropriate resources for their workloads
- Expensive hardware (GPUs, high-memory nodes) is used efficiently
- Short jobs don’t get stuck behind long-running simulations
- Testing and development has fast turnaround
- Usage can be tracked and billed accurately
Start with the templates above and adjust time limits, priorities, and access controls to match your organisation’s requirements. As your cluster grows, you can add specialised partitions for specific hardware or user groups.
