{"id":2203,"date":"2026-01-17T08:55:26","date_gmt":"2026-01-17T08:55:26","guid":{"rendered":"https:\/\/nicktailor.com\/tech-blog\/?p=2203"},"modified":"2026-01-18T03:10:22","modified_gmt":"2026-01-18T03:10:22","slug":"slurm-production-partitions-a-practical-guide-to-job-scheduling","status":"publish","type":"post","link":"https:\/\/nicktailor.com\/tech-blog\/slurm-production-partitions-a-practical-guide-to-job-scheduling\/","title":{"rendered":"SLURM Production Partitions: A Practical Guide to Job Scheduling"},"content":{"rendered":"<article>When managing HPC clusters in production, how you structure your SLURM partitions directly impacts cluster efficiency, user experience, and resource utilisation. A well-designed partition layout ensures the right jobs land on the right hardware, fair scheduling across user groups, and predictable turnaround times.This post covers typical production partition configurations and provides ready-to-use job script templates for each workload type.<\/p>\n<hr \/>\n<h2>What is a SLURM Partition?<\/h2>\n<p>A <strong>partition<\/strong> in SLURM is a logical grouping of compute nodes with shared attributes and scheduling policies. Think of partitions as queues\u00a0 users submit jobs to a partition, and SLURM schedules them according to that partition&#8217;s rules.<\/p>\n<p>Partitions allow you to:<\/p>\n<ul>\n<li>Separate hardware types (GPU nodes, high-memory nodes, standard compute)<\/li>\n<li>Set different time limits and priorities<\/li>\n<li>Control access for different user groups<\/li>\n<li>Apply different preemption and scheduling policies<\/li>\n<li>Track usage for billing and chargeback<\/li>\n<\/ul>\n<hr \/>\n<h2>Typical Production Partition Layout<\/h2>\n<p>A typical production cluster uses partitions structured by resource type and job priority:<\/p>\n<pre><code># slurm.conf partition configuration\n\nPartitionName=batch    Nodes=node[001-100]  Default=YES  MaxTime=24:00:00  State=UP\nPartitionName=short    Nodes=node[001-100]  MaxTime=1:00:00   Priority=100  State=UP\nPartitionName=long     Nodes=node[001-100]  MaxTime=7-00:00:00  Priority=10  State=UP\nPartitionName=gpu      Nodes=gpu[01-16]     MaxTime=24:00:00  State=UP\nPartitionName=highmem  Nodes=mem[01-08]     MaxTime=24:00:00  State=UP\nPartitionName=debug    Nodes=node[001-004]  MaxTime=00:30:00  Priority=200  State=UP\nPartitionName=preempt  Nodes=node[001-100]  MaxTime=24:00:00  PreemptMode=REQUEUE  State=UP\n<\/code><\/pre>\n<hr \/>\n<h2>Partition Definitions<\/h2>\n<h3>batch<\/h3>\n<p>The <strong>batch<\/strong> partition is the default queue where most standard compute jobs land. It provides a balance between time limits and priority, suitable for the majority of production workloads. If a user submits a job without specifying a partition, it goes here.<\/p>\n<h3>short<\/h3>\n<p>The <strong>short<\/strong> partition is for quick jobs that need fast turnaround. Higher priority ensures these jobs start quickly, but strict time limits (typically 1 hour or less) prevent users from abusing it for long-running work. Ideal for pre-processing, quick analyses, and iterative development.<\/p>\n<h3>long<\/h3>\n<p>The <strong>long<\/strong> partition accommodates multi-day jobs such as climate simulations, molecular dynamics, or large-scale training runs. Lower priority prevents these jobs from blocking shorter work, but they get scheduled during quieter periods or through backfill.<\/p>\n<h3>gpu<\/h3>\n<p>The <strong>gpu<\/strong> partition contains nodes equipped with GPUs (NVIDIA A100s, H100s, etc.). Separating GPU resources ensures expensive accelerators aren&#8217;t wasted on CPU-only workloads and allows for GPU-specific scheduling policies and billing.<\/p>\n<h3>highmem<\/h3>\n<p>The <strong>highmem<\/strong> partition groups high-memory nodes (typically 1TB+ RAM) for memory-intensive workloads like genome assembly, large-scale data analysis, or in-memory databases. These nodes are expensive, so isolating them prevents standard jobs from occupying them unnecessarily.<\/p>\n<h3>debug<\/h3>\n<p>The <strong>debug<\/strong> partition provides rapid access for testing and development. Highest priority and very short time limits (15-30 minutes) ensure users can quickly validate their scripts before submitting large production jobs. Usually limited to a small subset of nodes.<\/p>\n<h3>preempt<\/h3>\n<p>The <strong>preempt<\/strong> partition offers opportunistic access to idle resources. Jobs here can be killed and requeued when higher-priority work arrives. Ideal for fault-tolerant workloads that checkpoint regularly. Users get free cycles in exchange for accepting interruption.<\/p>\n<hr \/>\n<h2>Job Script Templates<\/h2>\n<p>Below are production-ready job script templates for each partition type. Adjust resource requests to match your specific workload requirements.<\/p>\n<h3>Standard Batch Job<\/h3>\n<p>Use the batch partition for typical compute workloads with moderate runtime requirements.<\/p>\n<pre><code>#!\/bin\/bash\n#SBATCH --job-name=simulation\n#SBATCH --partition=batch\n#SBATCH --nodes=2\n#SBATCH --ntasks=32\n#SBATCH --cpus-per-task=1\n#SBATCH --mem=64G\n#SBATCH --time=12:00:00\n#SBATCH --output=%x_%j.out\n\nmodule load openmpi\/4.1.4\nmpirun .\/simulate --input data.in\n<\/code><\/pre>\n<hr \/>\n<h3>Debug Job<\/h3>\n<p>Use the debug partition to quickly test job scripts before submitting large production runs. Keep it short \u2014 this partition is for validation, not real work.<\/p>\n<pre><code>#!\/bin\/bash\n#SBATCH --job-name=test_run\n#SBATCH --partition=debug\n#SBATCH --nodes=1\n#SBATCH --ntasks=4\n#SBATCH --time=00:15:00\n#SBATCH --output=%x_%j.out\n\n# Quick sanity check before submitting big job\n.\/app --test-mode\n<\/code><\/pre>\n<hr \/>\n<h3>GPU Training Job<\/h3>\n<p>Use the gpu partition for machine learning training, rendering, or any GPU-accelerated workload. Request specific GPU counts and ensure CUDA environments are loaded.<\/p>\n<pre><code>#!\/bin\/bash\n#SBATCH --job-name=train_model\n#SBATCH --partition=gpu\n#SBATCH --nodes=1\n#SBATCH --ntasks=1\n#SBATCH --cpus-per-task=8\n#SBATCH --gpus=4\n#SBATCH --mem=128G\n#SBATCH --time=24:00:00\n#SBATCH --output=%x_%j.out\n\nmodule load cuda\/12.2 python\/3.11\n\nCUDA_VISIBLE_DEVICES=0,1,2,3 python train.py --epochs 100\n<\/code><\/pre>\n<hr \/>\n<h3>High Memory Job<\/h3>\n<p>Use the highmem partition for memory-intensive workloads that exceed standard node capacity. Common use cases include genome assembly, large graph processing, and in-memory analytics.<\/p>\n<pre><code>#!\/bin\/bash\n#SBATCH --job-name=genome_assembly\n#SBATCH --partition=highmem\n#SBATCH --nodes=1\n#SBATCH --ntasks=1\n#SBATCH --cpus-per-task=32\n#SBATCH --mem=1T\n#SBATCH --time=48:00:00\n#SBATCH --output=%x_%j.out\n\nmodule load assembler\/2.1\n\nassembler --threads 32 --memory 900G --input reads.fastq\n<\/code><\/pre>\n<hr \/>\n<h3>Long Running Job<\/h3>\n<p>Use the long partition for multi-day simulations. Always enable email notifications for job completion or failure, and implement checkpointing for fault tolerance.<\/p>\n<pre><code>#!\/bin\/bash\n#SBATCH --job-name=climate_sim\n#SBATCH --partition=long\n#SBATCH --nodes=8\n#SBATCH --ntasks=256\n#SBATCH --time=7-00:00:00\n#SBATCH --output=%x_%j.out\n#SBATCH --mail-type=END,FAIL\n#SBATCH --mail-user=user@company.com\n\nmodule load openmpi netcdf\n\nmpirun .\/climate_model --checkpoint-interval 6h\n<\/code><\/pre>\n<hr \/>\n<h3>Preemptible Backfill Job<\/h3>\n<p>Use the preempt partition for opportunistic workloads that can tolerate interruption. The <code>--requeue<\/code> flag ensures the job restarts if preempted. Your application must support checkpointing and resumption.<\/p>\n<pre><code>#!\/bin\/bash\n#SBATCH --job-name=backfill_work\n#SBATCH --partition=preempt\n#SBATCH --nodes=4\n#SBATCH --ntasks=64\n#SBATCH --time=24:00:00\n#SBATCH --requeue\n#SBATCH --output=%x_%j.out\n\n# Must handle being killed and restarted\n.\/app --checkpoint-dir=\/scratch\/checkpoints --resume\n<\/code><\/pre>\n<hr \/>\n<h2>SBATCH Directive Reference<\/h2>\n<p>Common SBATCH directives used across job scripts:<\/p>\n<table>\n<thead>\n<tr>\n<th>Directive<\/th>\n<th>Purpose<\/th>\n<th>Example<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><code>--job-name<\/code><\/td>\n<td>Job identifier in queue and logs<\/td>\n<td><code>--job-name=my_simulation<\/code><\/td>\n<\/tr>\n<tr>\n<td><code>--partition<\/code><\/td>\n<td>Target partition\/queue<\/td>\n<td><code>--partition=gpu<\/code><\/td>\n<\/tr>\n<tr>\n<td><code>--nodes<\/code><\/td>\n<td>Number of nodes required<\/td>\n<td><code>--nodes=4<\/code><\/td>\n<\/tr>\n<tr>\n<td><code>--ntasks<\/code><\/td>\n<td>Total number of tasks (MPI ranks)<\/td>\n<td><code>--ntasks=64<\/code><\/td>\n<\/tr>\n<tr>\n<td><code>--cpus-per-task<\/code><\/td>\n<td>CPU cores per task (for threading)<\/td>\n<td><code>--cpus-per-task=8<\/code><\/td>\n<\/tr>\n<tr>\n<td><code>--mem<\/code><\/td>\n<td>Memory per node<\/td>\n<td><code>--mem=128G<\/code><\/td>\n<\/tr>\n<tr>\n<td><code>--gpus<\/code><\/td>\n<td>Number of GPUs required<\/td>\n<td><code>--gpus=4<\/code><\/td>\n<\/tr>\n<tr>\n<td><code>--time<\/code><\/td>\n<td>Maximum wall time (D-HH:MM:SS)<\/td>\n<td><code>--time=24:00:00<\/code><\/td>\n<\/tr>\n<tr>\n<td><code>--output<\/code><\/td>\n<td>Standard output file (%x=job name, %j=job ID)<\/td>\n<td><code>--output=%x_%j.out<\/code><\/td>\n<\/tr>\n<tr>\n<td><code>--mail-type<\/code><\/td>\n<td>Email notification triggers<\/td>\n<td><code>--mail-type=END,FAIL<\/code><\/td>\n<\/tr>\n<tr>\n<td><code>--requeue<\/code><\/td>\n<td>Requeue job if preempted or failed<\/td>\n<td><code>--requeue<\/code><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<hr \/>\n<h2>Partition Selection Guide<\/h2>\n<table>\n<thead>\n<tr>\n<th>Partition<\/th>\n<th>Typical Use Case<\/th>\n<th>Time Limit<\/th>\n<th>Priority<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>debug<\/strong><\/td>\n<td>Testing scripts before production runs<\/td>\n<td>15-30 min<\/td>\n<td>Highest<\/td>\n<\/tr>\n<tr>\n<td><strong>short<\/strong><\/td>\n<td>Quick jobs, preprocessing, iteration<\/td>\n<td>1 hour<\/td>\n<td>High<\/td>\n<\/tr>\n<tr>\n<td><strong>batch<\/strong><\/td>\n<td>Standard compute workloads<\/td>\n<td>24 hours<\/td>\n<td>Normal<\/td>\n<\/tr>\n<tr>\n<td><strong>gpu<\/strong><\/td>\n<td>ML training, rendering, GPU compute<\/td>\n<td>24 hours<\/td>\n<td>Normal<\/td>\n<\/tr>\n<tr>\n<td><strong>highmem<\/strong><\/td>\n<td>Genomics, large datasets, in-memory work<\/td>\n<td>48 hours<\/td>\n<td>Normal<\/td>\n<\/tr>\n<tr>\n<td><strong>long<\/strong><\/td>\n<td>Multi-day simulations<\/td>\n<td>7 days<\/td>\n<td>Low<\/td>\n<\/tr>\n<tr>\n<td><strong>preempt<\/strong><\/td>\n<td>Opportunistic, fault-tolerant workloads<\/td>\n<td>24 hours<\/td>\n<td>Lowest<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<hr \/>\n<h2>My Thoughts<\/h2>\n<p>A well-structured partition layout is the foundation of effective HPC cluster management. By separating resources by type and priority, you ensure:<\/p>\n<ul>\n<li>Users get appropriate resources for their workloads<\/li>\n<li>Expensive hardware (GPUs, high-memory nodes) is used efficiently<\/li>\n<li>Short jobs don&#8217;t get stuck behind long-running simulations<\/li>\n<li>Testing and development has fast turnaround<\/li>\n<li>Usage can be tracked and billed accurately<\/li>\n<\/ul>\n<p>Start with the templates above and adjust time limits, priorities, and access controls to match your organisation&#8217;s requirements. As your cluster grows, you can add specialised partitions for specific hardware or user groups.<\/p>\n<\/article>\n","protected":false},"excerpt":{"rendered":"<p>When managing HPC clusters in production, how you structure your SLURM partitions directly impacts cluster efficiency, user experience, and resource utilisation. A well-designed partition layout ensures the right jobs land on the right hardware, fair scheduling across user groups, and predictable turnaround times.This post covers typical production partition configurations and provides ready-to-use job script templates for each workload type. What<a href=\"https:\/\/nicktailor.com\/tech-blog\/slurm-production-partitions-a-practical-guide-to-job-scheduling\/\" class=\"read-more\">Read More &#8230;<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[143],"tags":[],"class_list":["post-2203","post","type-post","status-publish","format-standard","hentry","category-hpc"],"_links":{"self":[{"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/posts\/2203","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/comments?post=2203"}],"version-history":[{"count":2,"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/posts\/2203\/revisions"}],"predecessor-version":[{"id":2209,"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/posts\/2203\/revisions\/2209"}],"wp:attachment":[{"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/media?parent=2203"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/categories?post=2203"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/tags?post=2203"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}