How to configure Slurm Controller Node on Ubuntu 22.04
How to setup HPC-Slurm Controller Node
Refer to Key Components for HPC Cluster Setup; for which pieces you need to setup.
This guide provides step-by-step instructions for setting up the Slurm controller daemon (`slurmctld`) on Ubuntu 22.04. It also includes common errors encountered during the setup process and how to resolve them.
Step 1: Install Prerequisites
To begin, install the required dependencies for Slurm and its components:
sudo apt update && sudo apt upgrade -y
sudo apt install -y munge libmunge-dev libmunge2 build-essential man-db mariadb-server mariadb-client libmariadb-dev python3 python3-pip chrony
Step 2: Configure Munge (Authentication for slurm)
Munge is required for authentication within the Slurm cluster.
1. Generate a Munge key on the controller node:
sudo create-munge-key
2. Copy the key to all compute nodes:
scp /etc/munge/munge.key user@node:/etc/munge/
3. Start the Munge service:
sudo systemctl enable –now munge
Step 3: Install Slurm
1. Download and compile Slurm:
wget https://download.schedmd.com/slurm/slurm-23.02.4.tar.bz2
tar -xvjf slurm-23.02.4.tar.bz2
cd slurm-23.02.4
./configure –prefix=/usr/local/slurm –sysconfdir=/etc/slurm
make -j$(nproc)
sudo make install
2. Create necessary directories and set permissions:
sudo mkdir -p /etc/slurm /var/spool/slurm /var/log/slurm
sudo chown slurm: /var/spool/slurm /var/log/slurm
3. Add the Slurm user:
sudo useradd -m slurm
Step 4: Configure Slurm; more complex configs contact Nick Tailor
1. Generate a basic `slurm.conf` using the configurator tool at
https://slurm.schedmd.com/configurator.html. Save the configuration to `/etc/slurm/slurm.conf`.
# Basic Slurm Configuration
ClusterName=my_cluster
ControlMachine=slurmctld # Replace with your control node’s hostname
# BackupController=backup-slurmctld # Uncomment and replace if you have a backup controller
# Authentication
AuthType=auth/munge
CryptoType=crypto/munge
# Logging
SlurmdLogFile=/var/log/slurm/slurmd.log
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmctldDebug=info
SlurmdDebug=info
# Slurm User
SlurmUser=slurm
StateSaveLocation=/var/spool/slurm
SlurmdSpoolDir=/var/spool/slurmd
# Scheduler
SchedulerType=sched/backfill
SchedulerParameters=bf_continue
# Accounting
AccountingStorageType=accounting_storage/none
JobAcctGatherType=jobacct_gather/linux
# Compute Nodes
NodeName=node[1-2] CPUs=4 RealMemory=8192 State=UNKNOWN
PartitionName=debug Nodes=node[1-2] Default=YES MaxTime=INFINITE State=UP
2. Distribute `slurm.conf` to all compute nodes:
scp /etc/slurm/slurm.conf user@node:/etc/slurm/
3. Restart Slurm services:
sudo systemctl restart slurmctld
sudo systemctl restart slurmd
Troubleshooting Common Errors
root@slrmcltd:~# tail /var/log/slurm/slurmctld.log
[2024-12-06T11:57:25.428] error: High latency for 1000 calls to gettimeofday(): 20012 microseconds
[2024-12-06T11:57:25.431] fatal: mkdir(/var/spool/slurm): Permission denied
[2024-12-06T11:58:34.862] error: High latency for 1000 calls to gettimeofday(): 20029 microseconds
[2024-12-06T11:58:34.864] fatal: mkdir(/var/spool/slurm): Permission denied
[2024-12-06T11:59:38.843] error: High latency for 1000 calls to gettimeofday(): 18842 microseconds
[2024-12-06T11:59:38.847] fatal: mkdir(/var/spool/slurm): Permission denied
Error: Permission Denied for /var/spool/slurm
This error occurs when the `slurm` user does not have the correct permissions to access the directory.
Fix:
sudo mkdir -p /var/spool/slurm
sudo chown -R slurm: /var/spool/slurm
sudo chmod -R 755 /var/spool/slurm
Error: Temporary Failure in Name Resolution
Slurm could not resolve the hostname `slurmctld`. This can be fixed by updating `/etc/hosts`:
1. Edit `/etc/hosts` and add the following:
127.0.0.1 slurmctld
192.168.20.8 slurmctld
2. Verify the hostname matches `ControlMachine` in `/etc/slurm/slurm.conf`.
3. Restart networking and test hostname resolution:
sudo systemctl restart systemd-networkd
ping slurmctld
Error: High Latency for gettimeofday()
Dec 06 11:57:25 slrmcltd.home systemd[1]: Started Slurm controller daemon.
Dec 06 11:57:25 slrmcltd.home slurmctld[2619]: slurmctld: error: High latency for 1000 calls to gettimeofday(): 20012 microseconds
Dec 06 11:57:25 slrmcltd.home systemd[1]: slurmctld.service: Main process exited, code=exited, status=1/FAILURE
Dec 06 11:57:25 slrmcltd.home systemd[1]: slurmctld.service: Failed with result ‘exit-code’.
This warning typically indicates timing issues in the system.
Fixes:
1. Install and configure `chrony` for time synchronization:
sudo apt install chrony
sudo systemctl enable –now chrony
chronyc tracking
timedatectl
2. For virtualized environments, optimize the clocksource:
sudo echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource
3. Disable high-precision timing in `slurm.conf` (optional):
HighPrecisionTimer=NO
sudo systemctl restart slurmctld
Step 5: Verify and Test the Setup
1. Validate the configuration:
scontrol reconfigure
– no errors mean its working. If this doesn’t work check the connection between nodes
update your /etc/hosts to have the hosts all listed across the all machines and nodes.
2. Check node and partition status:
sinfo
root@slrmcltd:/etc/slurm# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 1 idle* node1
3. Monitor logs for errors:
sudo tail -f /var/log/slurm/slurmctld.log
Written By: Nick Tailor