{"id":1924,"date":"2024-12-06T12:59:00","date_gmt":"2024-12-06T12:59:00","guid":{"rendered":"https:\/\/www.nicktailor.com\/?p=1924"},"modified":"2024-12-06T13:46:26","modified_gmt":"2024-12-06T13:46:26","slug":"setting-up-slurmctld-on-ubuntu-22-04-with-troubleshooting","status":"publish","type":"post","link":"https:\/\/nicktailor.com\/tech-blog\/setting-up-slurmctld-on-ubuntu-22-04-with-troubleshooting\/","title":{"rendered":"How to setup HPC-Slurm Controller Node"},"content":{"rendered":"<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 10pt; padding-bottom: 0; line-height: 1.38;\">Refer to <strong><span style=\"color: #00ccff;\"><a style=\"color: #00ccff;\" href=\"https:\/\/www.nicktailor.com\/?p=1937\">Key Components for HPC Cluster Setup<\/a><\/span>; for which pieces you need to setup.<\/strong><\/p>\n<p>This guide provides step-by-step instructions for setting up the Slurm controller daemon (`slurmctld`) on <strong>Ubuntu 22.04.<\/strong> It also includes common errors encountered during the setup process and how to resolve them.<\/p>\n<h2 style=\"margin-top: 10pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.38; font-weight: bold; color: #4f81bd; font-size: 13pt;\">Step 1: Install Prerequisites<\/h2>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 10pt; padding-bottom: 0; line-height: 1.38;\">To begin, install the required dependencies for Slurm and its components:<\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 10pt; padding-bottom: 0; line-height: 1.38;\">sudo apt update &amp;&amp; sudo apt upgrade -y<br \/>\nsudo apt install -y munge libmunge-dev libmunge2 build-essential man-db mariadb-server mariadb-client libmariadb-dev python3 python3-pip chrony<\/p>\n<h2 style=\"margin-top: 10pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.38; font-weight: bold; color: #4f81bd; font-size: 13pt;\">Step 2: Configure Munge (Authentication for slurm)<\/h2>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 10pt; padding-bottom: 0; line-height: 1.38;\">Munge is required for authentication within the Slurm cluster.<\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 10pt; padding-bottom: 0; line-height: 1.38;\">1. Generate a Munge key on the controller node:<br \/>\nsudo create-munge-key<\/p>\n<p>2. Copy the key to all compute nodes:<br \/>\nscp \/etc\/munge\/munge.key user@node:\/etc\/munge\/<\/p>\n<p>3. Start the Munge service:<br \/>\nsudo systemctl enable &#8211;now munge<\/p>\n<h2 style=\"margin-top: 10pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.38; font-weight: bold; color: #4f81bd; font-size: 13pt;\">Step 3: Install Slurm<\/h2>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 10pt; padding-bottom: 0; line-height: 1.38;\">1. Download and compile Slurm:<br \/>\nwget https:\/\/download.schedmd.com\/slurm\/slurm-23.02.4.tar.bz2<br \/>\ntar -xvjf slurm-23.02.4.tar.bz2<br \/>\ncd slurm-23.02.4<br \/>\n.\/configure &#8211;prefix=\/usr\/local\/slurm &#8211;sysconfdir=\/etc\/slurm<br \/>\nmake -j$(nproc)<br \/>\nsudo make install<\/p>\n<p>2. Create necessary directories and set permissions:<br \/>\nsudo mkdir -p \/etc\/slurm \/var\/spool\/slurm \/var\/log\/slurm<br \/>\nsudo chown slurm: \/var\/spool\/slurm \/var\/log\/slurm<\/p>\n<p>3. Add the Slurm user:<br \/>\nsudo useradd -m slurm<\/p>\n<h2 style=\"margin-top: 10pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.38; font-weight: bold; color: #4f81bd; font-size: 13pt;\">Step 4: Configure Slurm; more complex configs contact Nick Tailor<\/h2>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 10pt; padding-bottom: 0; line-height: 1.38;\">1. Generate a basic `slurm.conf` using the configurator tool at<br \/>\nhttps:\/\/slurm.schedmd.com\/configurator.html. Save the configuration to `\/etc\/slurm\/slurm.conf`.<\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.2;\"># Basic Slurm Configuration<\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.2;\">ClusterName=my_cluster<\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.2;\">ControlMachine=slurmctld # Replace with your control node&#8217;s hostname<\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.2;\"># BackupController=backup-slurmctld # Uncomment and replace if you have a backup controller<\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.2;\"><span style=\"display: inline-block; height: 1em;\"><span style=\"display: none;\">.<\/span><\/span><\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.2;\"># Authentication<\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.2;\">AuthType=auth\/munge<\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.2;\">CryptoType=crypto\/munge<\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.2;\"><span style=\"display: inline-block; height: 1em;\"><span style=\"display: none;\">.<\/span><\/span><\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.2;\"># Logging<\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.2;\">SlurmdLogFile=\/var\/log\/slurm\/slurmd.log<\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.2;\">SlurmctldLogFile=\/var\/log\/slurm\/slurmctld.log<\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.2;\">SlurmctldDebug=info<\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.2;\">SlurmdDebug=info<\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.2;\"><span style=\"display: inline-block; height: 1em;\"><span style=\"display: none;\">.<\/span><\/span><\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.2;\"># Slurm User<\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.2;\">SlurmUser=slurm<\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.2;\">StateSaveLocation=\/var\/spool\/slurm<\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.2;\">SlurmdSpoolDir=\/var\/spool\/slurmd<\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.2;\"><span style=\"display: inline-block; height: 1em;\"><span style=\"display: none;\">.<\/span><\/span><\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.2;\"># Scheduler<\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.2;\">SchedulerType=sched\/backfill<\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.2;\">SchedulerParameters=bf_continue<\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.2;\"><span style=\"display: inline-block; height: 1em;\"><span style=\"display: none;\">.<\/span><\/span><\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.2;\"># Accounting<\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.2;\">AccountingStorageType=accounting_storage\/none<\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.2;\">JobAcctGatherType=jobacct_gather\/linux<\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.2;\"><span style=\"display: inline-block; height: 1em;\"><span style=\"display: none;\">.<\/span><\/span><\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.2;\"># Compute Nodes<\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.2;\">NodeName=node[1-2] CPUs=4 RealMemory=8192 State=UNKNOWN<\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.2;\">PartitionName=debug Nodes=node[1-2] Default=YES MaxTime=INFINITE State=UP<\/p>\n<p>2. Distribute `slurm.conf` to all compute nodes:<br \/>\nscp \/etc\/slurm\/slurm.conf user@node:\/etc\/slurm\/<\/p>\n<p>3. Restart Slurm services:<br \/>\nsudo systemctl restart slurmctld<br \/>\nsudo systemctl restart slurmd<\/p>\n<h2 style=\"margin-top: 10pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.38; font-weight: bold; color: #4f81bd; font-size: 13pt;\">Troubleshooting Common Errors<\/h2>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 10pt; padding-bottom: 0; line-height: 1.38;\"><span style=\"display: inline-block; height: 1em;\"><span style=\"display: none;\">.<\/span><\/span><\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 10pt; padding-bottom: 0; line-height: 1.38;\">root@slrmcltd:~# tail \/var\/log\/slurm\/slurmctld.log<\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.2;\">[2024-12-06T11:57:25.428] error: High latency for 1000 calls to gettimeofday(): 20012 microseconds<\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.2;\">[2024-12-06T11:57:25.431] fatal: mkdir(\/var\/spool\/slurm): Permission denied<\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.2;\">[2024-12-06T11:58:34.862] error: High latency for 1000 calls to gettimeofday(): 20029 microseconds<\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.2;\">[2024-12-06T11:58:34.864] fatal: mkdir(\/var\/spool\/slurm): Permission denied<\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.2;\">[2024-12-06T11:59:38.843] error: High latency for 1000 calls to gettimeofday(): 18842 microseconds<\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.2;\">[2024-12-06T11:59:38.847] fatal: mkdir(\/var\/spool\/slurm): Permission denied<\/p>\n<h3 style=\"margin-top: 10pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.38; font-weight: bold; color: #4f81bd;\">Error: Permission Denied for \/var\/spool\/slurm<\/h3>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 10pt; padding-bottom: 0; line-height: 1.38;\">This error occurs when the `slurm` user does not have the correct permissions to access the directory.<\/p>\n<p><span style=\"color: #00b050;\">Fix:<\/span><br \/>\n<span style=\"color: #00b050;\">sudo<\/span> <span style=\"color: #00b050;\">mkdir<\/span><span style=\"color: #00b050;\"> -p \/var\/spool\/<\/span><span style=\"color: #00b050;\">slurm<\/span><br \/>\n<span style=\"color: #00b050;\">sudo<\/span> <span style=\"color: #00b050;\">chown<\/span><span style=\"color: #00b050;\"> -R <\/span><span style=\"color: #00b050;\">slurm<\/span><span style=\"color: #00b050;\">: \/var\/spool\/<\/span><span style=\"color: #00b050;\">slurm<\/span><br \/>\n<span style=\"color: #00b050;\">sudo<\/span> <span style=\"color: #00b050;\">chmod<\/span><span style=\"color: #00b050;\"> -R 755 \/var\/spool\/<\/span><span style=\"color: #00b050;\">slurm<\/span><\/p>\n<h3 style=\"margin-top: 10pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.38; font-weight: bold; color: #4f81bd;\">Error: Temporary Failure in Name Resolution<\/h3>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 10pt; padding-bottom: 0; line-height: 1.38;\">Slurm could not resolve the hostname `slurmctld`. This can be fixed by updating `\/etc\/hosts`:<\/p>\n<p><span style=\"color: #00b050; background-color: #ffffff;\">1. Edit `\/etc\/hosts` and add the following:<\/span><span style=\"color: #00b050; background-color: #ffffff;\"><br \/>\n127.0.0.1 <\/span><span style=\"color: #00b050; background-color: #ffffff;\">slurmctld<\/span><span style=\"color: #00b050; background-color: #ffffff;\"><br \/>\n192.168.20.8 <\/span><span style=\"color: #00b050; background-color: #ffffff;\">slurmctld<\/span><\/p>\n<p>2. Verify the hostname matches `ControlMachine` in `\/etc\/slurm\/slurm.conf`.<\/p>\n<p>3. Restart networking and test hostname resolution:<br \/>\nsudo systemctl restart systemd-networkd<br \/>\nping slurmctld<\/p>\n<h3 style=\"margin-top: 10pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.38; font-weight: bold; color: #4f81bd;\">Error: High Latency for gettimeofday()<\/h3>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.2;\">Dec 06 11:57:25 slrmcltd.home systemd[1]: Started Slurm controller daemon.<\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.2;\">Dec 06 11:57:25 slrmcltd.home slurmctld[2619]: slurmctld: error: High latency for 1000 calls to gettimeofday(): 20012 microseconds<\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.2;\">Dec 06 11:57:25 slrmcltd.home systemd[1]: slurmctld.service: Main process exited, code=exited, status=1\/FAILURE<\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.2;\">Dec 06 11:57:25 slrmcltd.home systemd[1]: slurmctld.service: Failed with result &#8216;exit-code&#8217;.<\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 10pt; padding-bottom: 0; line-height: 1.38;\">This warning typically indicates timing issues in the system.<\/p>\n<p><span style=\"color: #00b050;\">Fixes:<\/span><span style=\"color: #00b050;\"><br \/>\n1. Install and configure `<\/span><span style=\"color: #00b050;\">chrony<\/span><span style=\"color: #00b050;\">` for time synchronization:<\/span><span style=\"color: #00b050;\"><br \/>\n<\/span><span style=\"color: #00b050;\">sudo<\/span><span style=\"color: #00b050;\"> apt install <\/span><span style=\"color: #00b050;\">chrony<\/span><span style=\"color: #00b050;\"><br \/>\n<\/span><span style=\"color: #00b050;\">sudo<\/span> <span style=\"color: #00b050;\">systemctl<\/span><span style=\"color: #00b050;\"> enable &#8211;now <\/span><span style=\"color: #00b050;\">chrony<\/span><br \/>\n<span style=\"color: #00b050;\">\u00a0\u00a0\u00a0<\/span><span style=\"color: #00b050;\">chronyc<\/span><span style=\"color: #00b050;\"> tracking<\/span><span style=\"color: #00b050;\"><br \/>\n<\/span><span style=\"color: #00b050;\">timedatectl<\/span><br \/>\n2. For virtualized environments, optimize the clocksource:<br \/>\nsudo echo tsc &gt; \/sys\/devices\/system\/clocksource\/clocksource0\/current_clocksource<\/p>\n<p>3. Disable high-precision timing in `slurm.conf` (optional):<br \/>\nHighPrecisionTimer=NO<br \/>\nsudo systemctl restart slurmctld<\/p>\n<h2 style=\"margin-top: 10pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.38; font-weight: bold; color: #4f81bd; font-size: 13pt;\">Step 5: Verify and Test the Setup<\/h2>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 10pt; padding-bottom: 0; line-height: 1.38;\">1. Validate the configuration:<br \/>\nscontrol reconfigure<br \/>\n&#8211; no errors mean its working. If this doesn\u2019t work check the connection between nodes<br \/>\nupdate your \/etc\/hosts to have the hosts all listed across the all machines and nodes.<\/p>\n<p>2. Check node and partition status:<br \/>\nsinfo<\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 10pt; padding-bottom: 0; line-height: 1.38;\">root@slrmcltd:\/etc\/slurm# sinfo<\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 10pt; padding-bottom: 0; line-height: 1.38;\">PARTITION AVAIL TIMELIMIT NODES STATE NODELIST<\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 10pt; padding-bottom: 0; line-height: 1.38;\">debug* up infinite 1 idle* node1<\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 10pt; padding-bottom: 0; line-height: 1.38;\">3. Monitor logs for errors:<br \/>\nsudo tail -f \/var\/log\/slurm\/slurmctld.log<\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 10pt; padding-bottom: 0; line-height: 1.38;\"><span style=\"display: inline-block; height: 1em;\"><span style=\"display: none;\">.<\/span><\/span><\/p>\n<p style=\"margin-top: 0pt; padding-top: 0; margin-bottom: 10pt; padding-bottom: 0; line-height: 1.38;\">Written By: Nick Tailor<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Refer to Key Components for HPC Cluster Setup; for which pieces you need to setup. This guide provides step-by-step instructions for setting up the Slurm controller daemon (`slurmctld`) on Ubuntu 22.04. It also includes common errors encountered during the setup process and how to resolve them. Step 1: Install Prerequisites To begin, install the required dependencies for Slurm and its<a href=\"https:\/\/nicktailor.com\/tech-blog\/setting-up-slurmctld-on-ubuntu-22-04-with-troubleshooting\/\" class=\"read-more\">Read More &#8230;<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[143],"tags":[],"class_list":["post-1924","post","type-post","status-publish","format-standard","hentry","category-hpc"],"_links":{"self":[{"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/posts\/1924","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/comments?post=1924"}],"version-history":[{"count":9,"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/posts\/1924\/revisions"}],"predecessor-version":[{"id":1941,"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/posts\/1924\/revisions\/1941"}],"wp:attachment":[{"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/media?parent=1924"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/categories?post=1924"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/tags?post=1924"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}