{"id":1937,"date":"2024-12-06T13:43:55","date_gmt":"2024-12-06T13:43:55","guid":{"rendered":"https:\/\/www.nicktailor.com\/?p=1937"},"modified":"2024-12-06T13:43:56","modified_gmt":"2024-12-06T13:43:56","slug":"key-components-for-setting-up-an-hpc-cluster","status":"publish","type":"post","link":"https:\/\/nicktailor.com\/tech-blog\/key-components-for-setting-up-an-hpc-cluster\/","title":{"rendered":"Key Components for Setting Up an HPC Cluster"},"content":{"rendered":"<h2 style=\"margin-top: 10pt; padding-top: 0; margin-bottom: 0pt; padding-bottom: 0; line-height: 1.38; font-weight: bold; color: #4f81bd; font-size: 13pt;\">Head Node (Controller)<\/h2>\n<div class=\"ul\" style=\"margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;\">\n<div class=\"li\" style=\"margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;\">\n<div style=\"margin-top:0pt;padding-top:0;margin-bottom:0;padding-bottom:0;line-height:1.38;margin-left:18pt;\"><span style=\"display:inline-block;position:relative;text-indent:-18pt;\"><span style=\"position:absolute;top:-0.34em;left:0;font-size:2em;\">\u2022<\/span>&nbsp;<\/span>Manages job scheduling and resource allocation.<\/div>\n<\/div>\n<div class=\"li\" style=\"margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;\">\n<div style=\"margin-bottom:10pt;padding-bottom:0;line-height:1.38;margin-left:18pt;\"><span style=\"display:inline-block;position:relative;text-indent:-18pt;\"><span style=\"position:absolute;top:-0.34em;left:0;font-size:2em;\">\u2022<\/span>&nbsp;<\/span>Runs Slurm Controller Daemon (`slurmctld`).<\/div>\n<\/div>\n<\/div>\n<h2 style=\"margin-top:10pt;padding-top:0;margin-bottom:0pt;padding-bottom:0;line-height:1.38;font-weight:bold;color:#4F81BD;font-size:13pt;\">Compute Nodes<\/h2>\n<div class=\"ul\" style=\"margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;\" start=\"3\">\n<div class=\"li\" style=\"margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;\">\n<div style=\"margin-top:0pt;padding-top:0;margin-bottom:0;padding-bottom:0;line-height:1.38;margin-left:18pt;\"><span style=\"display:inline-block;position:relative;text-indent:-18pt;\"><span style=\"position:absolute;top:-0.34em;left:0;font-size:2em;\">\u2022<\/span>&nbsp;<\/span>Execute computational tasks.<\/div>\n<\/div>\n<div class=\"li\" style=\"margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;\">\n<div style=\"margin-bottom:0;padding-bottom:0;line-height:1.38;margin-left:18pt;\"><span style=\"display:inline-block;position:relative;text-indent:-18pt;\"><span style=\"position:absolute;top:-0.34em;left:0;font-size:2em;\">\u2022<\/span>&nbsp;<\/span>Run Slurm Node Daemon (`slurmd`).<\/div>\n<\/div>\n<div class=\"li\" style=\"margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;\">\n<div style=\"margin-bottom:10pt;padding-bottom:0;line-height:1.38;margin-left:18pt;\"><span style=\"display:inline-block;position:relative;text-indent:-18pt;\"><span style=\"position:absolute;top:-0.34em;left:0;font-size:2em;\">\u2022<\/span>&nbsp;<\/span>Configured for CPUs, GPUs, or specialized hardware.<\/div>\n<\/div>\n<\/div>\n<h2 style=\"margin-top:10pt;padding-top:0;margin-bottom:0pt;padding-bottom:0;line-height:1.38;font-weight:bold;color:#4F81BD;font-size:13pt;\">Networking<\/h2>\n<div class=\"ul\" style=\"margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;\" start=\"6\">\n<div class=\"li\" style=\"margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;\">\n<div style=\"margin-top:0pt;padding-top:0;margin-bottom:0;padding-bottom:0;line-height:1.38;margin-left:18pt;\"><span style=\"display:inline-block;position:relative;text-indent:-18pt;\"><span style=\"position:absolute;top:-0.34em;left:0;font-size:2em;\">\u2022<\/span>&nbsp;<\/span>High-speed interconnect like Infiniband or Ethernet.<\/div>\n<\/div>\n<div class=\"li\" style=\"margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;\">\n<div style=\"margin-bottom:10pt;padding-bottom:0;line-height:1.38;margin-left:18pt;\"><span style=\"display:inline-block;position:relative;text-indent:-18pt;\"><span style=\"position:absolute;top:-0.34em;left:0;font-size:2em;\">\u2022<\/span>&nbsp;<\/span>Ensures fast communication between nodes.<\/div>\n<\/div>\n<\/div>\n<h2 style=\"margin-top:10pt;padding-top:0;margin-bottom:0pt;padding-bottom:0;line-height:1.38;font-weight:bold;color:#4F81BD;font-size:13pt;\">Storage<\/h2>\n<div class=\"ul\" style=\"margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;\" start=\"8\">\n<div class=\"li\" style=\"margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;\">\n<div style=\"margin-top:0pt;padding-top:0;margin-bottom:0;padding-bottom:0;line-height:1.38;margin-left:18pt;\"><span style=\"display:inline-block;position:relative;text-indent:-18pt;\"><span style=\"position:absolute;top:-0.34em;left:0;font-size:2em;\">\u2022<\/span>&nbsp;<\/span>Centralized storage like NFS, Lustre, or BeeGFS.<\/div>\n<\/div>\n<div class=\"li\" style=\"margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;\">\n<div style=\"margin-bottom:10pt;padding-bottom:0;line-height:1.38;margin-left:18pt;\"><span style=\"display:inline-block;position:relative;text-indent:-18pt;\"><span style=\"position:absolute;top:-0.34em;left:0;font-size:2em;\">\u2022<\/span>&nbsp;<\/span>Provides shared file access for all nodes.<\/div>\n<\/div>\n<\/div>\n<h2 style=\"margin-top:10pt;padding-top:0;margin-bottom:0pt;padding-bottom:0;line-height:1.38;font-weight:bold;color:#4F81BD;font-size:13pt;\">Authentication<\/h2>\n<div class=\"ul\" style=\"margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;\" start=\"10\">\n<div class=\"li\" style=\"margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;\">\n<div style=\"margin-top:0pt;padding-top:0;margin-bottom:10pt;padding-bottom:0;line-height:1.38;margin-left:18pt;\"><span style=\"display:inline-block;position:relative;text-indent:-18pt;\"><span style=\"position:absolute;top:-0.34em;left:0;font-size:2em;\">\u2022<\/span>&nbsp;<\/span>Use Munge for secure communication between Slurm components.<\/div>\n<\/div>\n<\/div>\n<h2 style=\"margin-top:10pt;padding-top:0;margin-bottom:0pt;padding-bottom:0;line-height:1.38;font-weight:bold;color:#4F81BD;font-size:13pt;\">Scheduler<\/h2>\n<div class=\"ul\" style=\"margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;\" start=\"11\">\n<div class=\"li\" style=\"margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;\">\n<div style=\"margin-top:0pt;padding-top:0;margin-bottom:0;padding-bottom:0;line-height:1.38;margin-left:18pt;\"><span style=\"display:inline-block;position:relative;text-indent:-18pt;\"><span style=\"position:absolute;top:-0.34em;left:0;font-size:2em;\">\u2022<\/span>&nbsp;<\/span>Slurm for job scheduling and resource management.<\/div>\n<\/div>\n<div class=\"li\" style=\"margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;\">\n<div style=\"margin-bottom:10pt;padding-bottom:0;line-height:1.38;margin-left:18pt;\"><span style=\"display:inline-block;position:relative;text-indent:-18pt;\"><span style=\"position:absolute;top:-0.34em;left:0;font-size:2em;\">\u2022<\/span>&nbsp;<\/span>Configured with partitions and node definitions.<\/div>\n<\/div>\n<\/div>\n<h2 style=\"margin-top:10pt;padding-top:0;margin-bottom:0pt;padding-bottom:0;line-height:1.38;font-weight:bold;color:#4F81BD;font-size:13pt;\">Resource Management<\/h2>\n<div class=\"ul\" style=\"margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;\" start=\"13\">\n<div class=\"li\" style=\"margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;\">\n<div style=\"margin-top:0pt;padding-top:0;margin-bottom:0;padding-bottom:0;line-height:1.38;margin-left:18pt;\"><span style=\"display:inline-block;position:relative;text-indent:-18pt;\"><span style=\"position:absolute;top:-0.34em;left:0;font-size:2em;\">\u2022<\/span>&nbsp;<\/span>Use cgroups to control CPU, memory, and GPU usage.<\/div>\n<\/div>\n<div class=\"li\" style=\"margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;\">\n<div style=\"margin-bottom:10pt;padding-bottom:0;line-height:1.38;margin-left:18pt;\"><span style=\"display:inline-block;position:relative;text-indent:-18pt;\"><span style=\"position:absolute;top:-0.34em;left:0;font-size:2em;\">\u2022<\/span>&nbsp;<\/span>Optional: ProctrackType=cgroup in Slurm.<\/div>\n<\/div>\n<\/div>\n<h2 style=\"margin-top:10pt;padding-top:0;margin-bottom:0pt;padding-bottom:0;line-height:1.38;font-weight:bold;color:#4F81BD;font-size:13pt;\">Parallel File System (Optional)<\/h2>\n<div class=\"ul\" style=\"margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;\" start=\"15\">\n<div class=\"li\" style=\"margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;\">\n<div style=\"margin-top:0pt;padding-top:0;margin-bottom:0;padding-bottom:0;line-height:1.38;margin-left:18pt;\"><span style=\"display:inline-block;position:relative;text-indent:-18pt;\"><span style=\"position:absolute;top:-0.34em;left:0;font-size:2em;\">\u2022<\/span>&nbsp;<\/span>High-performance shared storage for parallel workloads.<\/div>\n<\/div>\n<div class=\"li\" style=\"margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;\">\n<div style=\"margin-bottom:10pt;padding-bottom:0;line-height:1.38;margin-left:18pt;\"><span style=\"display:inline-block;position:relative;text-indent:-18pt;\"><span style=\"position:absolute;top:-0.34em;left:0;font-size:2em;\">\u2022<\/span>&nbsp;<\/span>Examples: Lustre, GPFS.<\/div>\n<\/div>\n<\/div>\n<h2 style=\"margin-top:10pt;padding-top:0;margin-bottom:0pt;padding-bottom:0;line-height:1.38;font-weight:bold;color:#4F81BD;font-size:13pt;\">Interconnect Libraries<\/h2>\n<div class=\"ul\" style=\"margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;\" start=\"17\">\n<div class=\"li\" style=\"margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;\">\n<div style=\"margin-top:0pt;padding-top:0;margin-bottom:0;padding-bottom:0;line-height:1.38;margin-left:18pt;\"><span style=\"display:inline-block;position:relative;text-indent:-18pt;\"><span style=\"position:absolute;top:-0.34em;left:0;font-size:2em;\">\u2022<\/span>&nbsp;<\/span>MPI (Message Passing Interface) for distributed computing.<\/div>\n<\/div>\n<div class=\"li\" style=\"margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;\">\n<div style=\"margin-bottom:10pt;padding-bottom:0;line-height:1.38;margin-left:18pt;\"><span style=\"display:inline-block;position:relative;text-indent:-18pt;\"><span style=\"position:absolute;top:-0.34em;left:0;font-size:2em;\">\u2022<\/span>&nbsp;<\/span>Install libraries like OpenMPI or MPICH.<\/div>\n<\/div>\n<\/div>\n<h2 style=\"margin-top:10pt;padding-top:0;margin-bottom:0pt;padding-bottom:0;line-height:1.38;font-weight:bold;color:#4F81BD;font-size:13pt;\">Monitoring and Debugging Tools<\/h2>\n<div class=\"ul\" style=\"margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;\" start=\"19\">\n<div class=\"li\" style=\"margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;\">\n<div style=\"margin-top:0pt;padding-top:0;margin-bottom:0;padding-bottom:0;line-height:1.38;margin-left:18pt;\"><span style=\"display:inline-block;position:relative;text-indent:-18pt;\"><span style=\"position:absolute;top:-0.34em;left:0;font-size:2em;\">\u2022<\/span>&nbsp;<\/span>Tools like Prometheus, Grafana, or Ganglia for resource monitoring.<\/div>\n<\/div>\n<div class=\"li\" style=\"margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;\">\n<div style=\"margin-bottom:10pt;padding-bottom:0;line-height:1.38;margin-left:18pt;\"><span style=\"display:inline-block;position:relative;text-indent:-18pt;\"><span style=\"position:absolute;top:-0.34em;left:0;font-size:2em;\">\u2022<\/span>&nbsp;<\/span>Enable verbose logging in Slurm for debugging.<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Head Node (Controller) \u2022&nbsp;Manages job scheduling and resource allocation. \u2022&nbsp;Runs Slurm Controller Daemon (`slurmctld`). Compute Nodes \u2022&nbsp;Execute computational tasks. \u2022&nbsp;Run Slurm Node Daemon (`slurmd`). \u2022&nbsp;Configured for CPUs, GPUs, or specialized hardware. Networking \u2022&nbsp;High-speed interconnect like Infiniband or Ethernet. \u2022&nbsp;Ensures fast communication between nodes. Storage \u2022&nbsp;Centralized storage like NFS, Lustre, or BeeGFS. \u2022&nbsp;Provides shared file access for all nodes. Authentication \u2022&nbsp;Use<a href=\"https:\/\/nicktailor.com\/tech-blog\/key-components-for-setting-up-an-hpc-cluster\/\" class=\"read-more\">Read More &#8230;<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[143],"tags":[],"class_list":["post-1937","post","type-post","status-publish","format-standard","hentry","category-hpc"],"_links":{"self":[{"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/posts\/1937","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/comments?post=1937"}],"version-history":[{"count":1,"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/posts\/1937\/revisions"}],"predecessor-version":[{"id":1938,"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/posts\/1937\/revisions\/1938"}],"wp:attachment":[{"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/media?parent=1937"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/categories?post=1937"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/tags?post=1937"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}