SLURM Accounting Setup; my personal notes

SLURM accounting tracks every job that runs on your cluster — who submitted it, what resources it used, how long it ran, and which account to bill. This data powers fairshare scheduling, resource limits, usage reports, and chargeback billing.

This post walks through setting up SLURM accounting from scratch in a production environment, with the database on a dedicated server separate from the controller.


Architecture Overview

In production, you separate the database from the controller for performance and reliability:

Controller Node        Database Node          Compute Nodes
───────────────        ─────────────          ─────────────
slurmctld              slurmdbd               slurmd
                       MariaDB/MySQL          slurmd
                                              slurmd
                                              ...

How it works:

  • slurmctld (scheduler) sends job data to slurmdbd
  • slurmdbd (database daemon) writes to MariaDB/MySQL
  • Compute nodes (slurmd) just run jobs — no database access

The controller never talks directly to the database. slurmdbd is the middleman that handles connection pooling, batches writes, and queues data if the database is temporarily unavailable.


Prerequisites

Before starting, ensure you have:

  • Working SLURM cluster (slurmctld on controller, slurmd on compute nodes)
  • Dedicated database server (can be VM or physical)
  • Network connectivity between controller and database server
  • Consistent SLURM user/group (UID/GID must match across all nodes)
  • Munge authentication working across all nodes

Step 1: Install MariaDB on Database Server

On your dedicated database server:

# Install MariaDB
sudo apt update
sudo apt install mariadb-server mariadb-client -y

# Start and enable
sudo systemctl start mariadb
sudo systemctl enable mariadb

# Secure installation
sudo mysql_secure_installation

During secure installation:

  • Set root password
  • Remove anonymous users — Yes
  • Disallow root login remotely — Yes
  • Remove test database — Yes
  • Reload privilege tables — Yes

Step 2: Create SLURM Database and User

Log into MariaDB and create the database:

sudo mysql -u root -p
-- Create database
CREATE DATABASE slurm_acct_db;

-- Create slurm user with access from controller node
CREATE USER 'slurm'@'controller.example.com' IDENTIFIED BY 'your_secure_password';

-- Grant privileges
GRANT ALL PRIVILEGES ON slurm_acct_db.* TO 'slurm'@'controller.example.com';

-- If slurmdbd runs on the database server itself (alternative setup)
-- CREATE USER 'slurm'@'localhost' IDENTIFIED BY 'your_secure_password';
-- GRANT ALL PRIVILEGES ON slurm_acct_db.* TO 'slurm'@'localhost';

FLUSH PRIVILEGES;
EXIT;

Step 3: Configure MariaDB for Remote Access

Edit MariaDB configuration to allow connections from the controller:

sudo nano /etc/mysql/mariadb.conf.d/50-server.cnf

Find and modify the bind-address:

# Change from
bind-address = 127.0.0.1

# To (listen on all interfaces)
bind-address = 0.0.0.0

# Or specific IP
bind-address = 192.168.1.10

Add performance settings for SLURM workload:

[mysqld]
bind-address = 0.0.0.0
innodb_buffer_pool_size = 1G
innodb_log_file_size = 64M
innodb_lock_wait_timeout = 900
max_connections = 200

Restart MariaDB:

sudo systemctl restart mariadb

Open firewall if needed:

# UFW
sudo ufw allow from 192.168.1.0/24 to any port 3306

# Or firewalld
sudo firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="192.168.1.0/24" port protocol="tcp" port="3306" accept'
sudo firewall-cmd --reload

Step 4: Install slurmdbd on Database Server

You can run slurmdbd on the database server or the controller. Running it on the database server keeps database traffic local.

# On database server
sudo apt install slurmdbd -y

Step 5: Configure slurmdbd

Create the slurmdbd configuration file:

sudo nano /etc/slurm/slurmdbd.conf
# slurmdbd.conf - SLURM Database Daemon Configuration

# Daemon settings
DbdHost=dbserver.example.com
DbdPort=6819
SlurmUser=slurm

# Logging
LogFile=/var/log/slurm/slurmdbd.log
PidFile=/run/slurmdbd.pid
DebugLevel=info

# Database connection
StorageType=accounting_storage/mysql
StorageHost=localhost
StoragePort=3306
StorageUser=slurm
StoragePass=your_secure_password
StorageLoc=slurm_acct_db

# Archive settings (optional)
#ArchiveEvents=yes
#ArchiveJobs=yes
#ArchiveResvs=yes
#ArchiveSteps=no
#ArchiveSuspend=no
#ArchiveTXN=no
#ArchiveUsage=no
#ArchiveScript=/usr/sbin/slurm.dbd.archive

# Purge old data (optional - keep 12 months)
#PurgeEventAfter=12months
#PurgeJobAfter=12months
#PurgeResvAfter=12months
#PurgeStepAfter=12months
#PurgeSuspendAfter=12months
#PurgeTXNAfter=12months
#PurgeUsageAfter=12months

Set proper permissions:

# slurmdbd.conf must be readable only by SlurmUser (contains password)
sudo chown slurm:slurm /etc/slurm/slurmdbd.conf
sudo chmod 600 /etc/slurm/slurmdbd.conf

# Create log directory
sudo mkdir -p /var/log/slurm
sudo chown slurm:slurm /var/log/slurm

Step 6: Start slurmdbd

Start the daemon and verify it connects to the database:

# Start slurmdbd
sudo systemctl start slurmdbd
sudo systemctl enable slurmdbd

# Check status
sudo systemctl status slurmdbd

# Check logs for errors
sudo tail -f /var/log/slurm/slurmdbd.log

Successful startup looks like:

slurmdbd: debug:  slurmdbd version 23.02.4 started
slurmdbd: debug:  Listening on 0.0.0.0:6819
slurmdbd: info:   Registering cluster(s) with database

Step 7: Configure slurmctld to Use Accounting

On your controller node, edit slurm.conf:

sudo nano /etc/slurm/slurm.conf

Add accounting configuration:

# Accounting settings
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=dbserver.example.com
AccountingStoragePort=6819
AccountingStorageEnforce=associations,limits,qos,safe

# Job completion logging
JobCompType=jobcomp/none
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=30

# Process tracking (required for accurate accounting)
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup,task/affinity

AccountingStorageEnforce options:

  • associations — Users must have valid account association to submit jobs
  • limits — Enforce resource limits set on accounts/users
  • qos — Enforce Quality of Service settings
  • safe — Only allow jobs that can run within limits

Step 8: Open Firewall for slurmdbd

On the database server, allow connections from the controller:

# UFW
sudo ufw allow from 192.168.1.0/24 to any port 6819

# Or firewalld
sudo firewall-cmd --permanent --add-port=6819/tcp
sudo firewall-cmd --reload

Step 9: Restart slurmctld

On the controller:

sudo systemctl restart slurmctld

# Check it connected to slurmdbd
sudo tail -f /var/log/slurm/slurmctld.log

Look for:

slurmctld: accounting_storage/slurmdbd: init: AccountingStorageHost=dbserver.example.com:6819
slurmctld: accounting_storage/slurmdbd: init: Database connection established

Step 10: Create Cluster in Database

Register your cluster with the accounting database:

sudo sacctmgr add cluster mycluster

Verify:

sacctmgr show cluster
   Cluster     ControlHost  ControlPort   RPC     Share GrpJobs       GrpTRES GrpSubmit MaxJobs       MaxTRES MaxSubmit     MaxWall                  QOS   Def QOS
---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- ---------
 mycluster  controller.ex.         6817  9728         1                                                                                           normal

Step 11: Create Accounts

Create your account hierarchy:

# Create parent account (organisation)
sudo sacctmgr add account science Description="Science Division" Organization="MyOrg"

# Create department accounts under science
sudo sacctmgr add account physics Description="Physics Department" Organization="MyOrg" Parent=science
sudo sacctmgr add account chemistry Description="Chemistry Department" Organization="MyOrg" Parent=science
sudo sacctmgr add account biology Description="Biology Department" Organization="MyOrg" Parent=science

# Create standalone accounts
sudo sacctmgr add account ai Description="AI Research" Organization="MyOrg"
sudo sacctmgr add account engineering Description="Engineering" Organization="MyOrg"

View account hierarchy:

sacctmgr show account -s
   Account                Descr                  Org
---------- -------------------- --------------------
   science       Science Division                MyOrg
    physics    Physics Department                MyOrg
  chemistry  Chemistry Department                MyOrg
    biology    Biology Department                MyOrg
        ai          AI Research                MyOrg
engineering          Engineering                MyOrg

Step 12: Add Users to Accounts

# Add users to accounts
sudo sacctmgr add user jsmith Account=physics
sudo sacctmgr add user kwilson Account=ai
sudo sacctmgr add user pjones Account=chemistry

# User can belong to multiple accounts
sudo sacctmgr add user jsmith Account=ai

# Set default account for user
sudo sacctmgr modify user jsmith set DefaultAccount=physics

View user associations:

sacctmgr show assoc format=Cluster,Account,User,Partition,Share,MaxJobs,MaxCPUs
   Cluster    Account       User  Partition     Share  MaxJobs  MaxCPUs
---------- ---------- ---------- ---------- --------- -------- --------
 mycluster    physics     jsmith                    1
 mycluster         ai     jsmith                    1
 mycluster         ai    kwilson                    1
 mycluster  chemistry     pjones                    1

Step 13: Set Resource Limits

Apply limits at account or user level:

# Limit physics account to 500 CPUs max, 50 concurrent jobs
sudo sacctmgr modify account physics set MaxCPUs=500 MaxJobs=50

# Limit specific user
sudo sacctmgr modify user jsmith set MaxCPUs=100 MaxJobs=10

# Limit by partition
sudo sacctmgr modify user jsmith where partition=gpu set MaxCPUs=32 MaxJobs=2

View limits:

sacctmgr show assoc format=Cluster,Account,User,Partition,MaxJobs,MaxCPUs,MaxNodes
   Cluster    Account       User  Partition  MaxJobs  MaxCPUs MaxNodes
---------- ---------- ---------- ---------- -------- -------- --------
 mycluster    physics                              50      500
 mycluster    physics     jsmith                   10      100
 mycluster    physics     jsmith        gpu         2       32

Step 14: Configure Fairshare

Fairshare adjusts job priority based on historical usage. Heavy users get lower priority.

# Set shares (relative weight) for accounts
sudo sacctmgr modify account physics set Fairshare=100
sudo sacctmgr modify account chemistry set Fairshare=100
sudo sacctmgr modify account ai set Fairshare=200  # AI gets double weight

Enable fairshare in slurm.conf on the controller:

# Priority settings
PriorityType=priority/multifactor
PriorityWeightFairshare=10000
PriorityWeightAge=1000
PriorityWeightPartition=1000
PriorityWeightJobSize=500
PriorityDecayHalfLife=7-0
PriorityUsageResetPeriod=MONTHLY

Restart slurmctld after changes:

sudo systemctl restart slurmctld

Step 15: Verify Everything Works

Test job submission with accounting:

# Submit job with account
sbatch --account=physics --job-name=test --wrap="sleep 60"

# Check it's tracked
squeue
sacct -j JOBID

Check database connectivity:

# From controller
sacctmgr show cluster
sacctmgr show account
sacctmgr show assoc

Verify accounting is enforced:

# Try submitting without valid account (should fail if enforce=associations)
sbatch --account=nonexistent --wrap="hostname"
# Expected: error: Unable to allocate resources: Invalid account

Check usage reports:

sreport cluster utilization
sreport user top start=2026-01-01
sreport account top start=2026-01-01

Useful sacctmgr Commands

Command Purpose
sacctmgr show cluster List registered clusters
sacctmgr show account List all accounts
sacctmgr show account -s Show account hierarchy
sacctmgr show user List all users
sacctmgr show assoc Show all associations (user-account mappings)
sacctmgr add account NAME Create new account
sacctmgr add user NAME Account=X Add user to account
sacctmgr modify account X set MaxCPUs=Y Set account limits
sacctmgr modify user X set MaxJobs=Y Set user limits
sacctmgr delete user NAME Account=X Remove user from account
sacctmgr delete account NAME Delete account

Troubleshooting

slurmdbd won’t start

# Check logs
sudo tail -100 /var/log/slurm/slurmdbd.log

# Common issues:
# - Wrong database credentials in slurmdbd.conf
# - MySQL not running
# - Permissions on slurmdbd.conf (must be 600, owned by slurm)
# - Munge not running

slurmctld can’t connect to slurmdbd

# Test connectivity
telnet dbserver.example.com 6819

# Check firewall
sudo ufw status
sudo firewall-cmd --list-all

# Verify slurmdbd is listening
ss -tlnp | grep 6819

Jobs not being tracked

# Verify accounting is enabled
scontrol show config | grep AccountingStorage

# Should show:
# AccountingStorageType = accounting_storage/slurmdbd

# Check association exists for user
sacctmgr show assoc user=jsmith

Database connection errors

# Test MySQL connection from slurmdbd host
mysql -h localhost -u slurm -p slurm_acct_db

# Check MySQL is accepting connections
sudo systemctl status mariadb
sudo tail -100 /var/log/mysql/error.log

My Thoughts

Setting up SLURM accounting properly from the start saves headaches later. Once it’s running, you get automatic tracking of every job, fair scheduling between groups, and the data you need for billing and capacity planning.

Key points to remember:

  • Keep the database separate from the controller in production
  • slurmdbd is the middleman — controller never hits the database directly
  • Compute nodes don’t need database access, they just run jobs
  • Set up your account hierarchy before adding users
  • Use AccountingStorageEnforce to make accounting mandatory
  • Fairshare prevents any single group from hogging the cluster

The database is your audit trail. It tracks everything, so when someone asks “why is my job slow” or “how much did we use last month”, you have the answers.

Leave a Reply

Your email address will not be published. Required fields are marked *