{"id":2024,"date":"2025-07-15T14:46:32","date_gmt":"2025-07-15T14:46:32","guid":{"rendered":"https:\/\/www.nicktailor.com\/?p=2024"},"modified":"2025-07-15T14:49:23","modified_gmt":"2025-07-15T14:49:23","slug":"deploying-slurm-with-slinky-bridging-hpc-and-kubernetes-for-container-workloads","status":"publish","type":"post","link":"https:\/\/nicktailor.com\/tech-blog\/deploying-slurm-with-slinky-bridging-hpc-and-kubernetes-for-container-workloads\/","title":{"rendered":"Deploying SLURM with Slinky: Bridging HPC and Kubernetes for Container Workloads"},"content":{"rendered":"\n<!DOCTYPE html>\n<html>\n<head>\n    <meta charset=\"UTF-8\"\/>\n    <title>Deploying SLURM with Slinky: Bridging HPC and Kubernetes for Container Workloads<\/title>\n    <style>\n        .code-block {\n            background-color: #f5f5f5;\n            border: 1px solid #ddd;\n            border-radius: 4px;\n            padding: 15px;\n            margin: 15px 0;\n            font-family: 'Courier New', Courier, monospace;\n            font-size: 14px;\n            overflow-x: auto;\n            white-space: pre-wrap;\n        }\n        .inline-code {\n            background-color: #f5f5f5;\n            padding: 2px 4px;\n            border-radius: 3px;\n            font-family: 'Courier New', Courier, monospace;\n            font-size: 14px;\n        }\n        body {\n            font-family: Arial, sans-serif;\n            line-height: 1.6;\n            max-width: 800px;\n            margin: 0 auto;\n            padding: 20px;\n        }\n        h1, h2, h3 {\n            color: #333;\n        }\n        .highlight {\n            background-color: #fff3cd;\n            padding: 10px;\n            border-left: 4px solid #ffc107;\n            margin: 15px 0;\n        }\n    <\/style>\n<\/head>\n<body>\n\n<p>High-Performance Computing (HPC) environments are evolving rapidly, and the need to integrate traditional HPC job schedulers with modern containerized infrastructure has never been greater. Enter <strong>Slinky<\/strong> \u2013 SchedMD&#8217;s official project that seamlessly integrates SLURM with Kubernetes, enabling you to run containerized workloads through SLURM&#8217;s powerful scheduling capabilities.<\/p>\n\n<p>In this comprehensive guide, we&#8217;ll walk through deploying SLURM using Slinky with Docker container support, bringing together the best of both HPC and cloud-native worlds.<\/p>\n\n<h2>What is Slinky?<\/h2>\n\n<p>Slinky is a toolbox of components developed by SchedMD (the creators of SLURM) to integrate SLURM with Kubernetes. Unlike traditional approaches that force users to change how they interact with SLURM, Slinky preserves the familiar SLURM user experience while adding powerful container orchestration capabilities.<\/p>\n\n<p><strong>Key Components:<\/strong><\/p>\n<ul>\n    <li><strong>Slurm Operator<\/strong> &#8211; Manages SLURM clusters as Kubernetes resources<\/li>\n    <li><strong>Container Support<\/strong> &#8211; Native OCI container execution through SLURM<\/li>\n    <li><strong>Auto-scaling<\/strong> &#8211; Dynamic resource allocation based on workload demand<\/li>\n    <li><strong>Slurm Bridge<\/strong> &#8211; Converged workload scheduling and prioritization<\/li>\n<\/ul>\n\n<div class=\"highlight\">\n<strong>Why Slinky Matters:<\/strong> Slinky enables simultaneous management of HPC workloads using SLURM and containerized applications via Kubernetes on the same infrastructure, making it ideal for organizations running AI\/ML training, scientific simulations, and cloud-native applications.\n<\/div>\n\n<h2>Prerequisites and Environment Setup<\/h2>\n\n<p>Before we begin, ensure you have a working Kubernetes cluster with the following requirements:<\/p>\n\n<ul>\n    <li>Kubernetes 1.24+ cluster with admin access<\/li>\n    <li>Helm 3.x installed<\/li>\n    <li>kubectl configured and connected to your cluster<\/li>\n    <li>Sufficient cluster resources (minimum 4 CPU cores, 8GB RAM)<\/li>\n<\/ul>\n\n<h3>Step 1: Install Required Dependencies<\/h3>\n\n<p>Slinky requires several prerequisite components. Let&#8217;s install them using Helm:<\/p>\n\n<pre class=\"code-block\"># Add required Helm repositories\nhelm repo add prometheus-community https:\/\/prometheus-community.github.io\/helm-charts\nhelm repo add metrics-server https:\/\/kubernetes-sigs.github.io\/metrics-server\/\nhelm repo add bitnami https:\/\/charts.bitnami.com\/bitnami\nhelm repo add jetstack https:\/\/charts.jetstack.io\nhelm repo update\n\n# Install cert-manager for TLS certificate management\nhelm install cert-manager jetstack\/cert-manager \\\n  --namespace cert-manager --create-namespace --set crds.enabled=true\n\n# Install Prometheus stack for monitoring\nhelm install prometheus prometheus-community\/kube-prometheus-stack \\\n  --namespace prometheus --create-namespace --set installCRDs=true\n<\/pre>\n\n<p>Wait for all pods to be running before proceeding:<\/p>\n\n<pre class=\"code-block\"># Verify installations\nkubectl get pods -n cert-manager\nkubectl get pods -n prometheus\n<\/pre>\n\n<h2>Step 2: Deploy the Slinky SLURM Operator<\/h2>\n\n<p>Now we&#8217;ll install the core Slinky operator that manages SLURM clusters within Kubernetes:<\/p>\n\n<pre class=\"code-block\"># Download the default configuration\ncurl -L https:\/\/raw.githubusercontent.com\/SlinkyProject\/slurm-operator\/refs\/tags\/v0.2.1\/helm\/slurm-operator\/values.yaml \\\n  -o values-operator.yaml\n\n# Install the Slurm Operator\nhelm install slurm-operator oci:\/\/ghcr.io\/slinkyproject\/charts\/slurm-operator \\\n  --values=values-operator.yaml --version=0.2.1 \\\n  --namespace=slinky --create-namespace\n<\/pre>\n\n<p>Verify the operator is running:<\/p>\n\n<pre class=\"code-block\">kubectl get pods -n slinky\n# Expected output: slurm-operator pod in Running status\n<\/pre>\n\n<h2>Step 3: Configure Container Support<\/h2>\n\n<p>Before deploying the SLURM cluster, let&#8217;s configure it for container support. Download and modify the SLURM configuration:<\/p>\n\n<pre class=\"code-block\"># Download SLURM cluster configuration\ncurl -L https:\/\/raw.githubusercontent.com\/SlinkyProject\/slurm-operator\/refs\/tags\/v0.2.1\/helm\/slurm\/values.yaml \\\n  -o values-slurm.yaml\n<\/pre>\n\n<p>Edit <span class=\"inline-code\">values-slurm.yaml<\/span> to enable container support:<\/p>\n\n<pre class=\"code-block\"># Add container configuration to values-slurm.yaml\ncontroller:\n  config:\n    slurm.conf: |\n      # Basic cluster configuration\n      ClusterName=slinky-cluster\n      ControlMachine=slurm-controller-0\n      \n      # Enable container support\n      ProctrackType=proctrack\/cgroup\n      TaskPlugin=task\/cgroup,task\/affinity\n      PluginDir=\/usr\/lib64\/slurm\n      \n      # Authentication\n      AuthType=auth\/munge\n      \n      # Node configuration\n      NodeName=slurm-compute-debug-[0-9] CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=2 ThreadsPerCore=2 State=UNKNOWN\n      PartitionName=debug Nodes=slurm-compute-debug-[0-9] Default=YES MaxTime=INFINITE State=UP\n      \n      # Accounting\n      AccountingStorageType=accounting_storage\/slurmdbd\n      AccountingStorageHost=slurm-accounting-0\n\ncompute:\n  config:\n    oci.conf: |\n      # OCI container runtime configuration\n      RunTimeQuery=\"runc --version\"\n      RunTimeCreate=\"runc create %n.%u %b\"\n      RunTimeStart=\"runc start %n.%u\"\n      RunTimeKill=\"runc kill --all %n.%u SIGTERM\"\n      RunTimeDelete=\"runc delete --force %n.%u\"\n      \n      # Security and patterns\n      OCIPattern=\"^[a-zA-Z0-9][a-zA-Z0-9_.-]*$\"\n      CreateEnvFile=\"\/tmp\/slurm-oci-create-env-%j.%u.%t.tmp\"\n      RunTimeEnvExclude=\"HOME,PATH,LD_LIBRARY_PATH\"\n<\/pre>\n\n<h2>Step 4: Deploy the SLURM Cluster<\/h2>\n\n<p>Now deploy the SLURM cluster with container support enabled:<\/p>\n\n<pre class=\"code-block\"># Deploy SLURM cluster\nhelm install slurm oci:\/\/ghcr.io\/slinkyproject\/charts\/slurm \\\n  --values=values-slurm.yaml --version=0.2.1 \\\n  --namespace=slurm --create-namespace\n<\/pre>\n\n<p>Monitor the deployment progress:<\/p>\n\n<pre class=\"code-block\"># Watch pods come online\nkubectl get pods -n slurm -w\n\n# Expected pods:\n# slurm-accounting-0      1\/1     Running\n# slurm-compute-debug-0   1\/1     Running  \n# slurm-controller-0      2\/2     Running\n# slurm-exporter-xxx      1\/1     Running\n# slurm-login-xxx         1\/1     Running\n# slurm-mariadb-0         1\/1     Running\n# slurm-restapi-xxx       1\/1     Running\n<\/pre>\n\n<h2>Step 5: Access and Test the SLURM Cluster<\/h2>\n\n<p>Once all pods are running, connect to the SLURM login node:<\/p>\n\n<pre class=\"code-block\"># Get login node IP address\nSLURM_LOGIN_IP=\"$(kubectl get services -n slurm -l app.kubernetes.io\/instance=slurm,app.kubernetes.io\/name=login -o jsonpath=\"{.items[0].status.loadBalancer.ingress[0].ip}\")\"\n\n# SSH to login node (default port 2222)\nssh -p 2222 root@${SLURM_LOGIN_IP}\n<\/pre>\n\n<p>If you don&#8217;t have LoadBalancer support, use port-forwarding:<\/p>\n\n<pre class=\"code-block\"># Port forward to login pod\nkubectl port-forward -n slurm service\/slurm-login 2222:2222\n\n# Connect via localhost\nssh -p 2222 root@localhost\n<\/pre>\n\n<h2>Step 6: Running Container Jobs<\/h2>\n\n<p>Now for the exciting part \u2013 running containerized workloads through SLURM!<\/p>\n\n<h3>Basic Container Job<\/h3>\n\n<p>Create a simple container job script:<\/p>\n\n<pre class=\"code-block\"># Create a container job script\ncat > container_test.sh << EOF\n#!\/bin\/bash\n#SBATCH --job-name=container-hello\n#SBATCH --ntasks=1\n#SBATCH --time=00:05:00\n#SBATCH --container=docker:\/\/alpine:latest\n\necho \"Hello from containerized SLURM job!\"\necho \"Running on node: \\$(hostname)\"\necho \"Job ID: \\$SLURM_JOB_ID\"\necho \"Container OS: \\$(cat \/etc\/os-release | grep PRETTY_NAME)\"\nEOF\n\n# Submit the job\nsbatch container_test.sh\n\n# Check job status\nsqueue\n<\/pre>\n\n<h3>Interactive Container Sessions<\/h3>\n\n<p>Run containers interactively using <span class=\"inline-code\">srun<\/span>:<\/p>\n\n<pre class=\"code-block\"># Interactive Ubuntu container\nsrun --container=docker:\/\/ubuntu:20.04 \/bin\/bash\n\n# Quick command in Alpine container\nsrun --container=docker:\/\/alpine:latest \/bin\/sh -c \"echo 'Container execution successful'; uname -a\"\n\n# Python data science container\nsrun --container=docker:\/\/python:3.9 python -c \"import sys; print(f'Python {sys.version} running in container')\"\n<\/pre>\n\n<h3>GPU Container Jobs<\/h3>\n\n<p>If your cluster has GPU nodes, you can run GPU-accelerated containers:<\/p>\n\n<pre class=\"code-block\"># GPU container job\ncat > gpu_container.sh << EOF\n#!\/bin\/bash\n#SBATCH --job-name=gpu-test\n#SBATCH --gres=gpu:1\n#SBATCH --container=docker:\/\/nvidia\/cuda:11.0-runtime-ubuntu20.04\n\nnvidia-smi\nnvcc --version\nEOF\n\nsbatch gpu_container.sh\n<\/pre>\n\n<h3>MPI Container Jobs<\/h3>\n\n<p>Run parallel MPI applications in containers:<\/p>\n\n<pre class=\"code-block\"># MPI container job\ncat > mpi_container.sh << EOF\n#!\/bin\/bash\n#SBATCH --job-name=mpi-test\n#SBATCH --ntasks=4\n#SBATCH --container=docker:\/\/mpirun\/openmpi:latest\n\nmpirun -np \\$SLURM_NTASKS hostname\nEOF\n\nsbatch mpi_container.sh\n<\/pre>\n\n<h2>Step 7: Monitoring and Auto-scaling<\/h2>\n\n<h3>Monitor Cluster Health<\/h3>\n\n<p>Check SLURM cluster status from the login node:<\/p>\n\n<pre class=\"code-block\"># Check node status\nsinfo\n\n# Check running jobs\nsqueue\n\n# Check cluster configuration\nscontrol show config | grep -i container\n<\/pre>\n\n<h3>Kubernetes Monitoring<\/h3>\n\n<p>Monitor from the Kubernetes side:<\/p>\n\n<pre class=\"code-block\"># Check pod resource usage\nkubectl top pods -n slurm\n\n# View SLURM operator logs\nkubectl logs -n slinky deployment\/slurm-operator\n\n# Check custom resources\nkubectl get clusters.slinky.slurm.net -n slurm\nkubectl get nodesets.slinky.slurm.net -n slurm\n<\/pre>\n\n<h3>Configure Auto-scaling<\/h3>\n\n<p>Enable auto-scaling by updating your values file:<\/p>\n\n<pre class=\"code-block\"># Add to values-slurm.yaml\ncompute:\n  autoscaling:\n    enabled: true\n    minReplicas: 1\n    maxReplicas: 10\n    targetCPUUtilizationPercentage: 70\n\n# Update the deployment\nhelm upgrade slurm oci:\/\/ghcr.io\/slinkyproject\/charts\/slurm \\\n  --values=values-slurm.yaml --version=0.2.1 \\\n  --namespace=slurm\n<\/pre>\n\n<h2>Advanced Configuration Tips<\/h2>\n\n<h3>Custom Container Runtimes<\/h3>\n\n<p>Configure alternative container runtimes like Podman:<\/p>\n\n<pre class=\"code-block\"># Alternative oci.conf for Podman\ncompute:\n  config:\n    oci.conf: |\n      # Podman runtime configuration\n      RunTimeQuery=\"podman --version\"\n      RunTimeRun=\"podman run --rm --cgroups=disabled --name=%n.%u %m %c\"\n      \n      # Security settings\n      OCIPattern=\"^[a-zA-Z0-9][a-zA-Z0-9_.-]*$\"\n      CreateEnvFile=\"\/tmp\/slurm-oci-create-env-%j.%u.%t.tmp\"\n<\/pre>\n\n<h3>Persistent Storage for Containers<\/h3>\n\n<p>Configure persistent volumes for containerized jobs:<\/p>\n\n<pre class=\"code-block\"># Add persistent volume support\ncompute:\n  persistence:\n    enabled: true\n    storageClass: \"fast-ssd\"\n    size: \"100Gi\"\n    mountPath: \"\/shared\"\n<\/pre>\n\n<h2>Troubleshooting Common Issues<\/h2>\n\n<h3>Container Runtime Not Found<\/h3>\n\n<p>If you encounter container runtime errors:<\/p>\n\n<pre class=\"code-block\"># Check runtime availability on compute nodes\nkubectl exec -n slurm slurm-compute-debug-0 -- which runc\nkubectl exec -n slurm slurm-compute-debug-0 -- runc --version\n\n# Verify oci.conf is properly mounted\nkubectl exec -n slurm slurm-compute-debug-0 -- cat \/etc\/slurm\/oci.conf\n<\/pre>\n\n<h3>Job Submission Failures<\/h3>\n\n<p>Debug job submission issues:<\/p>\n\n<pre class=\"code-block\"># Check SLURM logs\nkubectl logs -n slurm slurm-controller-0 -c slurmctld\n\n# Verify container image availability\nsrun --container=docker:\/\/alpine:latest \/bin\/echo \"Container test\"\n\n# Check job details\nscontrol show job <job_id>\n<\/job_id><\/pre>\n\n<h2>Conclusion<\/h2>\n\n<p>Slinky represents a significant step forward in bridging the gap between traditional HPC and modern cloud-native infrastructure. By deploying SLURM with Slinky, you get:<\/p>\n\n<ul>\n    <li><strong>Unified Infrastructure<\/strong> - Run both SLURM and Kubernetes workloads on the same cluster<\/li>\n    <li><strong>Container Support<\/strong> - Native OCI container execution through familiar SLURM commands<\/li>\n    <li><strong>Auto-scaling<\/strong> - Dynamic resource allocation based on workload demand<\/li>\n    <li><strong>Cloud Native<\/strong> - Standard Kubernetes deployment and management patterns<\/li>\n    <li><strong>Preserved Workflow<\/strong> - Keep existing SLURM scripts and user experience<\/li>\n<\/ul>\n\n<p>This powerful combination enables organizations to modernize their HPC infrastructure while maintaining the robust scheduling and resource management capabilities that SLURM is known for. Whether you're running AI\/ML training workloads, scientific simulations, or data processing pipelines, Slinky provides the flexibility to containerize your applications without sacrificing the control and efficiency of SLURM.<\/p>\n\n<div class=\"highlight\">\n<strong>Next Steps:<\/strong> Consider exploring Slinky's advanced features like custom schedulers, resource quotas, and integration with cloud provider auto-scaling groups to further optimize your HPC container workloads.\n<\/div>\n\n<p><em>Ready to get started? The Slinky project is open-source and available on GitHub. Visit the <a href=\"https:\/\/github.com\/SlinkyProject\">SlinkyProject GitHub organization<\/a> for the latest documentation and releases.<\/em><\/p>\n\n<\/body>\n<\/html>\n","protected":false},"excerpt":{"rendered":"<p>Deploying SLURM with Slinky: Bridging HPC and Kubernetes for Container Workloads High-Performance Computing (HPC) environments are evolving rapidly, and the need to integrate traditional HPC job schedulers with modern containerized infrastructure has never been greater. Enter Slinky \u2013 SchedMD&#8217;s official project that seamlessly integrates SLURM with Kubernetes, enabling you to run containerized workloads through SLURM&#8217;s powerful scheduling capabilities. In this<a href=\"https:\/\/nicktailor.com\/tech-blog\/deploying-slurm-with-slinky-bridging-hpc-and-kubernetes-for-container-workloads\/\" class=\"read-more\">Read More &#8230;<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[143],"tags":[],"class_list":["post-2024","post","type-post","status-publish","format-standard","hentry","category-hpc"],"_links":{"self":[{"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/posts\/2024","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/comments?post=2024"}],"version-history":[{"count":1,"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/posts\/2024\/revisions"}],"predecessor-version":[{"id":2025,"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/posts\/2024\/revisions\/2025"}],"wp:attachment":[{"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/media?parent=2024"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/categories?post=2024"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/nicktailor.com\/tech-blog\/wp-json\/wp\/v2\/tags?post=2024"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}