Author: admin

admin | December 6, 2024

How to configure Slurm Controller Node on Ubuntu 22.04

How to setup HPC-Slurm Controller Node

Refer to Key Components for HPC Cluster Setup; for which pieces you need to setup.

This guide provides step-by-step instructions for setting up the Slurm controller daemon (`slurmctld`) on Ubuntu 22.04. It also includes common errors encountered during the setup process and how to resolve them.

Step 1: Install Prerequisites

To begin, install the required dependencies for Slurm and its components:

sudo apt update && sudo apt upgrade -y
sudo apt install -y munge libmunge-dev libmunge2 build-essential man-db mariadb-server mariadb-client libmariadb-dev python3 python3-pip chrony

Step 2: Configure Munge (Authentication for slurm)

Munge is required for authentication within the Slurm cluster.

1. Generate a Munge key on the controller node:
sudo create-munge-key

2. Copy the key to all compute nodes:
scp /etc/munge/munge.key user@node:/etc/munge/

3. Start the Munge service:
sudo systemctl enable –now munge

Step 3: Install Slurm

1. Download and compile Slurm:
wget https://download.schedmd.com/slurm/slurm-23.02.4.tar.bz2
tar -xvjf slurm-23.02.4.tar.bz2
cd slurm-23.02.4
./configure –prefix=/usr/local/slurm –sysconfdir=/etc/slurm
make -j$(nproc)
sudo make install

2. Create necessary directories and set permissions:
sudo mkdir -p /etc/slurm /var/spool/slurm /var/log/slurm
sudo chown slurm: /var/spool/slurm /var/log/slurm

3. Add the Slurm user:
sudo useradd -m slurm

Step 4: Configure Slurm; more complex configs contact Nick Tailor

1. Generate a basic `slurm.conf` using the configurator tool at
https://slurm.schedmd.com/configurator.html. Save the configuration to `/etc/slurm/slurm.conf`.

# Basic Slurm Configuration

ClusterName=my_cluster

ControlMachine=slurmctld # Replace with your control node’s hostname

# BackupController=backup-slurmctld # Uncomment and replace if you have a backup controller

# Authentication

AuthType=auth/munge

CryptoType=crypto/munge

# Logging

SlurmdLogFile=/var/log/slurm/slurmd.log

SlurmctldLogFile=/var/log/slurm/slurmctld.log

SlurmctldDebug=info

SlurmdDebug=info

# Slurm User

SlurmUser=slurm

StateSaveLocation=/var/spool/slurm

SlurmdSpoolDir=/var/spool/slurmd

# Scheduler

SchedulerType=sched/backfill

SchedulerParameters=bf_continue

# Accounting

AccountingStorageType=accounting_storage/none

JobAcctGatherType=jobacct_gather/linux

# Compute Nodes

NodeName=node[1-2] CPUs=4 RealMemory=8192 State=UNKNOWN

PartitionName=debug Nodes=node[1-2] Default=YES MaxTime=INFINITE State=UP

2. Distribute `slurm.conf` to all compute nodes:
scp /etc/slurm/slurm.conf user@node:/etc/slurm/

3. Restart Slurm services:
sudo systemctl restart slurmctld
sudo systemctl restart slurmd

Troubleshooting Common Errors

root@slrmcltd:~# tail /var/log/slurm/slurmctld.log

[2024-12-06T11:57:25.428] error: High latency for 1000 calls to gettimeofday(): 20012 microseconds

[2024-12-06T11:57:25.431] fatal: mkdir(/var/spool/slurm): Permission denied

[2024-12-06T11:58:34.862] error: High latency for 1000 calls to gettimeofday(): 20029 microseconds

[2024-12-06T11:58:34.864] fatal: mkdir(/var/spool/slurm): Permission denied

[2024-12-06T11:59:38.843] error: High latency for 1000 calls to gettimeofday(): 18842 microseconds

[2024-12-06T11:59:38.847] fatal: mkdir(/var/spool/slurm): Permission denied

Error: Permission Denied for /var/spool/slurm

This error occurs when the `slurm` user does not have the correct permissions to access the directory.

Fix:
sudo mkdir -p /var/spool/slurm
sudo chown -R slurm: /var/spool/slurm
sudo chmod -R 755 /var/spool/slurm

Error: Temporary Failure in Name Resolution

Slurm could not resolve the hostname `slurmctld`. This can be fixed by updating `/etc/hosts`:

1. Edit `/etc/hosts` and add the following:
127.0.0.1 slurmctld
192.168.20.8 slurmctld

2. Verify the hostname matches `ControlMachine` in `/etc/slurm/slurm.conf`.

3. Restart networking and test hostname resolution:
sudo systemctl restart systemd-networkd
ping slurmctld

Error: High Latency for gettimeofday()

Dec 06 11:57:25 slrmcltd.home systemd[1]: Started Slurm controller daemon.

Dec 06 11:57:25 slrmcltd.home slurmctld[2619]: slurmctld: error: High latency for 1000 calls to gettimeofday(): 20012 microseconds

Dec 06 11:57:25 slrmcltd.home systemd[1]: slurmctld.service: Main process exited, code=exited, status=1/FAILURE

Dec 06 11:57:25 slrmcltd.home systemd[1]: slurmctld.service: Failed with result ‘exit-code’.

This warning typically indicates timing issues in the system.

Fixes:
1. Install and configure `chrony` for time synchronization:
sudo apt install chrony
sudo systemctl enable –now chrony
chronyc tracking
timedatectl
2. For virtualized environments, optimize the clocksource:
sudo echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource

3. Disable high-precision timing in `slurm.conf` (optional):
HighPrecisionTimer=NO
sudo systemctl restart slurmctld

Step 5: Verify and Test the Setup

1. Validate the configuration:
scontrol reconfigure
– no errors mean its working. If this doesn’t work check the connection between nodes
update your /etc/hosts to have the hosts all listed across the all machines and nodes.

2. Check node and partition status:
sinfo

root@slrmcltd:/etc/slurm# sinfo

PARTITION AVAIL TIMELIMIT NODES STATE NODELIST

debug* up infinite 1 idle* node1

3. Monitor logs for errors:
sudo tail -f /var/log/slurm/slurmctld.log

Written By: Nick Tailor

admin December 6, 2024 HPCNo Comments »

admin | November 23, 2024

Kubernetes Cheat Sheet

kubectl Context and Configuration

Manage which Kubernetes cluster kubectl communicates with, and configure authentication and namespace defaults.

kubectl config view                               # View merged kubeconfig

# Use multiple kubeconfig files simultaneously
export KUBECONFIG=~/.kube/config:~/.kube/kubconfig2
kubectl config view

# Extract a specific user's password
kubectl config view -o jsonpath='{.users[?(@.name == "e2e")].user.password}'

# List users
kubectl config view -o jsonpath='{.users[*].name}'

# Context management
kubectl config get-contexts                        # List contexts
kubectl config current-context                     # Show active context
kubectl config use-context my-cluster              # Switch context

# Add a cluster entry
kubectl config set-cluster my-cluster

# Set proxy URL for cluster entry
kubectl config set-cluster my-cluster --proxy-url=my-proxy-url

# Add a user with basic authentication
kubectl config set-credentials kubeuser/foo.kubernetes.com \
  --username=kubeuser --password=kubepassword

# Set default namespace for current context
kubectl config set-context --current --namespace=production

# Set a new context with specific namespace and user
kubectl config set-context gce --user=cluster-admin --namespace=foo \
  && kubectl config use-context gce

# Delete a user
kubectl config unset users.foo

Helpful aliases:

# Quickly switch or show context
alias kx='f() { [ "$1" ] && kubectl config use-context $1 || kubectl config current-context ; } ; f'

# Quickly switch or show namespace
alias kn='f() { [ "$1" ] && kubectl config set-context --current --namespace $1 \
  || kubectl config view --minify | grep namespace | cut -d" " -f6 ; } ; f'

kubectl apply (Declarative Management)

kubectl apply is the recommended method for managing resources in production. It creates or updates resources by applying a desired state.

kubectl apply -f ./app.yaml                         # Apply single file
kubectl apply -f ./manifests/                       # Apply directory
kubectl apply -f https://example.com/app.yaml       # Apply from URL

kubectl create deployment nginx --image=nginx       # Quick one-shot deployment

Create multiple manifests via stdin:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: pod-one
spec:
  containers:
  - name: c
    image: busybox
    args: ["sleep", "1000"]
---
apiVersion: v1
kind: Pod
metadata:
  name: pod-two
spec:
  containers:
  - name: c
    image: busybox
    args: ["sleep", "2000"]
EOF

Create a secret:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Secret
metadata:
  name: mysecret
type: Opaque
data:
  username: $(echo -n "jane" | base64 -w0)
  password: $(echo -n "s33msi4" | base64 -w0)
EOF

Viewing and Finding Resources

kubectl get pods                                   # Pods in namespace
kubectl get pods -A                                # All namespaces
kubectl get pods -o wide                           # Pod node placement
kubectl get deployments                            # Deployments
kubectl get svc                                     # Services
kubectl describe pod my-pod                         # Detailed pod info
kubectl describe node my-node                       # Node details

Sorting:

kubectl get pods --sort-by='.status.containerStatuses[0].restartCount'
kubectl get pv --sort-by=.spec.capacity.storage

Field and label selectors:

kubectl get pods --field-selector=status.phase=Running
kubectl get pods -l app=web
kubectl get nodes --selector='!node-role.kubernetes.io/control-plane'

Retrieve specific fields:

kubectl get configmap myconfig -o jsonpath='{.data.ca\.crt}'
kubectl get secret my-secret -o jsonpath='{.data.username}' | base64 --decode

Updating Resources and Rolling Updates

kubectl set image deployment/web web=nginx:1.25          # Update image
kubectl rollout history deployment/web                    # View history
kubectl rollout undo deployment/web                       # Roll back
kubectl rollout restart deployment/web                    # Rolling restart
kubectl rollout status deployment/web                     # Watch rollout

Patching Resources

kubectl patch node node1 -p '{"spec": {"unschedulable": true}}'

# Strategic merge patch
kubectl patch pod app-pod -p '{
  "spec": {"containers":[{"name":"app","image":"new-image"}]}
}'

# JSON patch
kubectl patch pod app-pod --type=json -p='[
  {"op":"replace","path":"/spec/containers/0/image","value":"new-image"}
]'

Editing Resources

kubectl edit svc/web-service
KUBE_EDITOR="nano" kubectl edit deployment/web

Change between:
ClusterIP
NodePort
LoadBalancer
ExternalName
Port
Targetport
NodePort
Protocol

Scaling Resources

kubectl scale deployment/web --replicas=5
kubectl scale -f deployment.yaml --replicas=4

Deleting Resources

kubectl delete -f ./app.yaml
kubectl delete pod my-pod --now
kubectl delete pods,svc -l app=web
kubectl delete pod,svc --all -n test

Interacting With Running Pods

kubectl logs my-pod
kubectl logs -f my-pod
kubectl exec my-pod -- ls /
kubectl exec -it my-pod -- sh
kubectl port-forward svc/web 8080:80

Copying Files to and from Containers

kubectl cp /tmp/localfile my-pod:/tmp/remote
kubectl cp my-pod:/tmp/remote /tmp/localfile

Advanced (using tar):

tar cf - . | kubectl exec -i my-pod -- tar xf - -C /tmp

Interacting With Nodes and Cluster

kubectl cordon node1
kubectl drain node1
kubectl uncordon node1

kubectl top node
kubectl top pod

kubectl cluster-info
kubectl cluster-info dump

Discovering API Resources

kubectl api-resources
kubectl api-resources --namespaced=true
kubectl api-resources -o wide
kubectl api-resources --verbs=list,get

Kubectl Output Formatting

kubectl get pods -o json
kubectl get pods -o yaml
kubectl get pods -o wide
kubectl get pods -o name
kubectl get pods -o jsonpath='{.items[*].metadata.name}'

Custom columns:

kubectl get pods -A -o=custom-columns='IMAGE:spec.containers[*].image'

Kubectl Verbosity and Debugging

–v=0 Minimal logs
–v=2 Recommended default
–v=4 Debug level
–v=6+ Full HTTP request inspection

Production-Ready Deployment YAML (Corrected)

Below is a cleaned-up and production-ready Deployment YAML based on your original example.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  namespace: my-namespace
  labels:
    app: nginx
spec:
  replicas: 3
  revisionHistoryLimit: 5
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        fsGroup: 1000
      containers:
        - name: nginx
          image: nginx:1.25
          ports:
            - containerPort: 80
          resources:
            requests:
              cpu: "100m"
              memory: "128Mi"
            limits:
              cpu: "300m"
              memory: "256Mi"
          readinessProbe:
            httpGet:
              path: /
              port: 80
            initialDelaySeconds: 3
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /
              port: 80
            initialDelaySeconds: 10
            periodSeconds: 20
          securityContext:
            readOnlyRootFilesystem: true
            allowPrivilegeEscalation: false

Conclusion

Kubernetes Cheat Sheet

This Kubernetes cheat sheet is a comprehensive and practical reference for working with kubectl, managing kubeconfig files, deploying Kubernetes workloads, viewing and troubleshooting cluster resources, and interacting with running workloads. It also includes a corrected production-ready Deployment YAML example. Everything below is ready to copy and paste directly into your WordPress editor.

kubectl Context and Configuration

Manage which Kubernetes cluster kubectl communicates with, and configure authentication and namespace defaults.

kubectl config view                               # View merged kubeconfig
# Use multiple kubeconfig files simultaneously
export KUBECONFIG=~/.kube/config:~/.kube/kubconfig2
kubectl config view
# Extract a specific user's password
kubectl config view -o jsonpath='{.users[?(@.name == "e2e")].user.password}'
# List users
kubectl config view -o jsonpath='{.users[*].name}'
# Context management
kubectl config get-contexts                        # List contexts
kubectl config current-context                     # Show active context
kubectl config use-context my-cluster              # Switch context
# Add a cluster entry
kubectl config set-cluster my-cluster
# Set proxy URL for cluster entry
kubectl config set-cluster my-cluster --proxy-url=my-proxy-url
# Add a user with basic authentication
kubectl config set-credentials kubeuser/foo.kubernetes.com \
  --username=kubeuser --password=kubepassword
# Set default namespace for current context
kubectl config set-context --current --namespace=production
# Set a new context with specific namespace and user
kubectl config set-context gce --user=cluster-admin --namespace=foo \
  && kubectl config use-context gce
# Delete a user
kubectl config unset users.foo

Helpful aliases:

# Quickly switch or show context
alias kx='f() { [ "$1" ] && kubectl config use-context $1 || kubectl config current-context ; } ; f'
# Quickly switch or show namespace
alias kn='f() { [ "$1" ] && kubectl config set-context --current --namespace $1 \
  || kubectl config view --minify | grep namespace | cut -d" " -f6 ; } ; f'

kubectl apply (Declarative Management)

kubectl apply is the recommended method for managing resources in production. It creates or updates resources by applying a desired state.

kubectl apply -f ./app.yaml                         # Apply single file
kubectl apply -f ./manifests/                       # Apply directory
kubectl apply -f https://example.com/app.yaml       # Apply from URL
kubectl create deployment nginx --image=nginx       # Quick one-shot deployment

Create multiple manifests via stdin:

cat <

Create a secret:

cat <

Viewing and Finding Resources

kubectl get pods                                   # Pods in namespace
kubectl get pods -A                                # All namespaces
kubectl get pods -o wide                           # Pod node placement
kubectl get deployments                            # Deployments
kubectl get svc                                     # Services
kubectl describe pod my-pod                         # Detailed pod info
kubectl describe node my-node                       # Node details

Sorting:

kubectl get pods --sort-by='.status.containerStatuses[0].restartCount'
kubectl get pv --sort-by=.spec.capacity.storage

Field and label selectors:

kubectl get pods --field-selector=status.phase=Running
kubectl get pods -l app=web
kubectl get nodes --selector='!node-role.kubernetes.io/control-plane'

Retrieve specific fields:

kubectl get configmap myconfig -o jsonpath='{.data.ca\.crt}'
kubectl get secret my-secret -o jsonpath='{.data.username}' | base64 --decode

Updating Resources and Rolling Updates

kubectl set image deployment/web web=nginx:1.25          # Update image
kubectl rollout history deployment/web                    # View history
kubectl rollout undo deployment/web                       # Roll back
kubectl rollout restart deployment/web                    # Rolling restart
kubectl rollout status deployment/web                     # Watch rollout

Patching Resources

kubectl patch node node1 -p '{"spec": {"unschedulable": true}}'
# Strategic merge patch
kubectl patch pod app-pod -p '{
  "spec": {"containers":[{"name":"app","image":"new-image"}]}
}'
# JSON patch
kubectl patch pod app-pod --type=json -p='[
  {"op":"replace","path":"/spec/containers/0/image","value":"new-image"}
]'

Editing Resources

kubectl edit svc/web-service
KUBE_EDITOR="nano" kubectl edit deployment/web

Scaling Resources

kubectl scale deployment/web --replicas=5
kubectl scale -f deployment.yaml --replicas=4

Deleting Resources

kubectl delete -f ./app.yaml
kubectl delete pod my-pod --now
kubectl delete pods,svc -l app=web
kubectl delete pod,svc --all -n test

Interacting With Running Pods

kubectl logs my-pod
kubectl logs -f my-pod
kubectl exec my-pod -- ls /
kubectl exec -it my-pod -- sh
kubectl port-forward svc/web 8080:80

Copying Files to and from Containers

kubectl cp /tmp/localfile my-pod:/tmp/remote
kubectl cp my-pod:/tmp/remote /tmp/localfile

Advanced (using tar):

tar cf - . | kubectl exec -i my-pod -- tar xf - -C /tmp

Interacting With Nodes and Cluster

kubectl cordon node1
kubectl drain node1
kubectl uncordon node1
kubectl top node
kubectl top pod
kubectl cluster-info
kubectl cluster-info dump

Discovering API Resources

kubectl api-resources
kubectl api-resources --namespaced=true
kubectl api-resources -o wide
kubectl api-resources --verbs=list,get

Kubectl Output Formatting

kubectl get pods -o json
kubectl get pods -o yaml
kubectl get pods -o wide
kubectl get pods -o name
kubectl get pods -o jsonpath='{.items[*].metadata.name}'

Custom columns:

kubectl get pods -A -o=custom-columns='IMAGE:spec.containers[*].image'

Kubectl Verbosity and Debugging

–v=0 Minimal logs
–v=2 Recommended default
–v=4 Debug level
–v=6+ Full HTTP request inspection

Production-Ready Deployment YAML (Corrected)

Below is a cleaned-up and production-ready Deployment YAML based on your original example.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  namespace: my-namespace
  labels:
    app: nginx
spec:
  replicas: 3
  revisionHistoryLimit: 5
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        fsGroup: 1000
      containers:
        - name: nginx
          image: nginx:1.25
          ports:
            - containerPort: 80
          resources:
            requests:
              cpu: "100m"
              memory: "128Mi"
            limits:
              cpu: "300m"
              memory: "256Mi"
          readinessProbe:
            httpGet:
              path: /
              port: 80
            initialDelaySeconds: 3
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /
              port: 80
            initialDelaySeconds: 10
            periodSeconds: 20
          securityContext:
            readOnlyRootFilesystem: true
            allowPrivilegeEscalation: false

Conclusion

This Kubernetes cheat sheet provides a complete quick-reference for daily cluster operations, including context switching, applying manifests, rolling updates, patching, scaling, and debugging. With the included production-ready Deployment YAML and working examples, you can confidently operate Kubernetes clusters and deploy applications using the recommended declarative approach.

admin November 23, 2024 KubernetesNo Comments »

admin | November 12, 2024

Deploying Lustre File System with RDMA, Node Maps, and ACLs

Lustre is the de facto parallel file system for high-performance computing (HPC) clusters, providing extreme scalability, high throughput, and low-latency access across thousands of nodes. This guide walks through a complete deployment of Lustre using RDMA over InfiniBand for performance, along with Node Maps for client access control and ACLs for fine-grained permissions.

1. Understanding the Lustre Architecture

Lustre separates metadata and data services into distinct roles:

MGS (Management Server) – Manages Lustre configuration and coordinates cluster services.
MDT (Metadata Target) – Stores file system metadata (names, permissions, directories).
OST (Object Storage Target) – Stores file data blocks.
Clients – Mount and access the Lustre file system for I/O.

The typical architecture looks like this:

+-------------+        +-------------+
|   Client 1  |        |   Client 2  |
| /mnt/lustre |        | /mnt/lustre |
+------+------+        +------+------+
       |                        |
       +--------o2ib RDMA-------+
                |
        +-------+-------+
        |     OSS/OST    |
        |   (Data I/O)   |
        +-------+-------+
                |
        +-------+-------+
        |     MGS/MDT    |
        |  (Metadata)    |
        +---------------+

2. Prerequisites and Environment

Component	Requirements
OS	RHEL / Rocky / AlmaLinux 8.x or higher
Kernel	Built with Lustre and OFED RDMA modules
Network	InfiniBand fabric (Mellanox or compatible)
Lustre Version	2.14 or later
Devices	Separate block devices for MDT, OST(s), and client mount

3. Install Lustre Packages

On MGS, MDT, and OSS Nodes:

dnf install -y lustre kmod-lustre lustre-osd-ldiskfs

On Client Nodes:

dnf install -y lustre-client kmod-lustre-client

4. Configure InfiniBand and RDMA (o2ib)

InfiniBand provides the lowest latency for Lustre communication via RDMA. Configure the o2ib network type for Lustre.

1. Install and verify InfiniBand stack

dnf install -y rdma-core infiniband-diags perftest libibverbs-utils
systemctl enable --now rdma
ibstat

2. Configure IB network

nmcli con add type infiniband ifname ib0 con-name ib0 ip4 10.0.0.1/24
nmcli con up ib0

3. Verify RDMA link

ibv_devinfo
ibv_rc_pingpong -d mlx5_0

4. Configure LNET for o2ib

Create /etc/modprobe.d/lustre.conf with:

options lnet networks="o2ib(ib0)"

modprobe lnet
lnetctl lnet configure
lnetctl net add --net o2ib --if ib0
lnetctl net show

Expected output:

net:
  - net type: o2ib
    interfaces:
      0: ib0

5. Format and Mount Lustre Targets

Metadata Server (MGS + MDT)

mkfs.lustre --fsname=lustrefs --mgs --mdt --index=0 /dev/sdb
mount -t lustre /dev/sdb /mnt/mdt

Object Storage Server (OSS)

mkfs.lustre --fsname=lustrefs --ost --index=0 --mgsnode=<MGS>@o2ib /dev/sdc
mount -t lustre /dev/sdc /mnt/ost

Client Node

mount -t lustre <MGS>@o2ib:/lustrefs /mnt/lustre
sudo mkdir -p /mnt/lustre

sudo mount -t lustre \
  172.16.0.10@o2ib:/lustrefs \
  /mnt/lustre

example without ibnetwork
[root@vbox ~]# mount -t lustre 172.16.0.10@tcp:/lustre /mnt/lustre-client
[root@vbox ~]# 
[root@vbox ~]# # Verify the mount worked
[root@vbox ~]# df -h /mnt/lustre-client
Filesystem                Size  Used Avail Use% Mounted on
172.16.0.10@tcp:/lustre   12G  2.5M   11G   1% /mnt/lustre-client
[root@vbox ~]# lfs df -h
UUID                       bytes        Used   Available Use% Mounted on
lustre-MDT0000_UUID         4.5G        1.9M        4.1G   1% /mnt/lustre-client[MDT:0]
lustre-OST0000_UUID         7.5G        1.2M        7.0G   1% /mnt/lustre-client[OST:0]
lustre-OST0001_UUID         3.9G        1.2M        3.7G   1% /mnt/lustre-client[OST:1]
filesystem_summary:        11.4G        2.4M       10.7G   1% /mnt/lustre-client

6. Configuring Node Maps (Access Control)

Node maps allow administrators to restrict Lustre client access based on network or host identity.

1. View current node maps

lctl nodemap_list

2. Create a new node map for trusted clients

lctl nodemap_add trusted_clients

3. Add allowed network range or host

lctl nodemap_add_range trusted_clients 10.0.0.0/24

4. Enable enforcement

lctl set_param nodemap.trusted_clients.admin=1
lctl set_param nodemap.trusted_clients.trust_client_ids=1

5. Restrict default map

lctl set_param nodemap.default.reject_unauthenticated=1

This ensures only IPs in 10.0.0.0/24 can mount and access the Lustre filesystem.

7. Configuring Access Control Lists (ACLs)

Lustre supports standard POSIX ACLs for fine-grained directory and file permissions.

1. Enable ACL support on mount

mount -t lustre -o acl <MGS>@o2ib:/lustrefs /mnt/lustre

2. Verify ACL support

mount | grep lustre

Should show:

/dev/sda on /mnt/lustre type lustre (rw,acl)

3. Set ACLs on directories

setfacl -m u:researcher:rwx /mnt/lustre/projects
setfacl -m g:analysts:rx /mnt/lustre/reports

4. View ACLs

getfacl /mnt/lustre/projects

Sample output:

# file: projects
# owner: root
# group: root
user::rwx
user:researcher:rwx
group::r-x
group:analysts:r-x
mask::rwx
other::---

8. Verifying Cluster Health

On all nodes:

lctl ping <MGS>@o2ib
lctl dl
lctl get_param -n net.*.state

Check RDMA performance:

lctl get_param -n o2iblnd.*.stats

Check file system mount from client:

df -h /mnt/lustre

Optional: Check node map enforcement

Try mounting from an unauthorized IP — it should fail:

mount -t lustre <MGS>@o2ib:/lustrefs /mnt/test
mount.lustre: mount <MGS>@o2ib:/lustrefs at /mnt/test failed: Permission denied

9. Common Issues and Troubleshooting

Issue	Possible Cause	Resolution
`Mount failed: no route to host`	IB subnet mismatch or LNET not configured	Verify `lnetctl net show` and `ping -I ib0` between nodes.
`Permission denied`	Node map restriction active	Check `lctl nodemap_list` and ensure client IP range is allowed.
`Slow performance`	RDMA disabled or fallback to TCP	Verify `lctl list_nids` shows `@o2ib` transport.

10. Final Validation Checklist

InfiniBand RDMA verified with ibv_rc_pingpong
LNET configured for o2ib(ib0)
MGS, MDT, and OST mounted successfully
Clients connected via @o2ib
Node maps restricting unauthorized hosts
ACLs correctly enforcing directory-level access

Summary

With RDMA transport, Lustre achieves near line-rate performance while node maps and ACLs enforce robust security and access control. This combination provides a scalable, high-performance, and policy-driven storage environment ideal for AI, HPC, and research workloads.

admin November 12, 2024 HPCNo Comments »

admin | August 13, 2024

Mastering Podman: A Comprehensive Guide with Detailed Command Examples

Mastering Podman on Ubuntu: A Comprehensive Guide with Detailed Command Examples

Podman has become a popular alternative to Docker due to its flexibility, security, and rootless operation capabilities. This guide will walk you through the installation process and various advanced usage scenarios of Podman on Ubuntu, providing detailed examples for each command.

Table of Contents

1.How to Install Podman

2.How to Search for Images

3.How to Run Rootless Containers

4.How to Search for Containers

5.How to Add Ping to Containers

6.How to Expose Ports

7.How to Create a Network

8.How to Connect a Network Between Pods

9.How to Inspect a Network

10.How to Add a Static Address

11.How to Log On to a Container with

1. How to Install Podman

To get started with Podman on Ubuntu, follow these steps:

Update Package Index

Before installing any new software, it’s a good idea to update your package index to ensure you’re getting the latest version of Podman:

sudo apt update

Install Podman

With your package index updated, you can now install Podman. This command will download and install Podman and any necessary dependencies:

sudo apt install podman -y

Example Output:

kotlin

Reading package lists… Done

Building dependency tree

Reading state information… Done

The following additional packages will be installed:

…

After this operation, X MB of additional disk space will be used.

Do you want to continue? [Y/n] y

…

Setting up podman (4.0.2) …

Verifying Installation

After installation, verify that Podman is installed correctly:

podman –version

Example Output:

podman version 4.0.2

2. How to Search for Images

Before running a container, you may need to find an appropriate image. Podman allows you to search for images in various registries.

Search Docker Hub

To search for images on Docker Hub:

podman search ubuntu

Example Output:

lua

INDEX NAME DESCRIPTION STARS OFFICIAL AUTOMATED

docker.io docker.io/library/ubuntu Ubuntu is a Debian-based Linux operating sys… 12329 [OK]

docker.io docker.io/ubuntu-upstart Upstart is an event-based replacement for the … 108 [OK]

docker.io docker.io/tutum/ubuntu Ubuntu image with SSH access. For the root p… 39

docker.io docker.io/ansible/ubuntu14.04-ansible Ubuntu 14.04 LTS with ansible 9 [OK]

This command will return a list of Ubuntu images available in Docker Hub.

3. How to Run Rootless Containers

One of the key features of Podman is the ability to run containers without needing root privileges, enhancing security.

Running a Rootless Container

As a non-root user, you can run a container like this:

podman run –rm -it ubuntu

Example Output:

ruby

root@d2f56a8d1234:/#

This command runs an Ubuntu container in an interactive shell, without requiring root access on the host system.

Configuring Rootless Environment

Ensure your user is added to the subuid and subgid files for proper UID/GID mapping:

echo “$USER:100000:65536” | sudo tee -a /etc/subuid /etc/subgid

Example Output:

makefile

user:100000:65536

4. How to Search for Containers

Once you start using containers, you may need to find specific ones.

Listing All Containers

To list all containers (both running and stopped):

podman ps -a

Example Output:

CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES

d13c5bcf30fd docker.io/library/ubuntu:latest 3 minutes ago Exited (0) 2 minutes ago confident_mayer

Filtering Containers

You can filter containers by their status, names, or other attributes. For instance, to find running containers:

podman ps –filter status=running

Example Output:

CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES

No output indicates there are no running containers at the moment.

5. How to Add Ping to Containers

Some minimal Ubuntu images don’t come with ping installed. Here’s how to add it.

Installing Ping in an Ubuntu Container

First, start an Ubuntu container:

podman run -it –cap-add=CAP_NET_RAW ubuntu

Inside the container, install ping (part of the iputils-ping package):

apt update

apt install iputils-ping

Example Output:

mathematica

Get:1 http://archive.ubuntu.com/ubuntu focal InRelease [265 kB]

…

Setting up iputils-ping (3:20190709-3) …

Now you can use ping within the container.

6. How to Expose Ports

Exposing ports is crucial for running services that need to be accessible from outside the container.

Exposing a Port

To expose a port, use the -p flag with the podman run command:

podman run -d -p 8080:80 ubuntu -c “apt update && apt install -y nginx && nginx -g ‘daemon off;'”

Example Output:

54c11dff6a8d9b6f896028f2857c6d74bda60f61ff178165e041e5e2cb0c51c8

This command runs an Ubuntu container, installs Nginx, and exposes port 80 in the container as port 8080 on the host.

Exposing Multiple Ports

You can expose multiple ports by specifying additional -p flags:

podman run -d -p 8080:80 -p 443:443 ubuntu -c “apt update && apt install -y nginx && nginx -g ‘daemon off;'”

Example Output:

wasm

b67f7d89253a4e8f0b5f64dcb9f2f1d542973fbbce73e7cdd6729b35e0d1125c

7. How to Create a Network

Creating a custom network allows you to isolate containers and manage their communication.

Creating a Network

To create a new network:

podman network create mynetwork

Example Output:

mynetwork

This command creates a new network named mynetwork.

Running a Container on a Custom Network

podman run -d –network mynetwork ubuntu -c “apt update && apt install -y nginx && nginx -g ‘daemon off;'”

Example Output:

1e0d2fdb110c8e3b6f2f4f5462d1c9b99e9c47db2b16da6b2de1e4d9275c2a50

This container will now communicate with others on the mynetwork network.

8. How to Connect a Network Between Pods

Podman allows you to manage pods, which are groups of containers sharing the same network namespace.

Creating a Pod and Adding Containers

podman pod create mypod

podman run -dt –pod mypod ubuntu -c “apt update && apt install -y nginx && nginx -g ‘daemon off;'”

podman run -dt –pod mypod ubuntu -c “apt update && apt install -y redis-server && redis-server”

Example Output:

f04d1c28b030f24f3f7b91f9f68d07fe1e6a2d81caeb60c356c64b3f7f7412c7

8cf540eb8e1b0566c65886c684017d5367f2a167d82d7b3b8c3496cbd763d447

4f3402b31e20a07f545dbf69cb4e1f61290591df124bdaf736de64bc3d40d4b1

Both containers now share the same network namespace and can communicate over the mypod network.

Connecting Pods to a Network

To connect a pod to an existing network:

podman pod create –network mynetwork mypod

Example Output:

f04d1c28b030f24f3f7b91f9f68d07fe1e6a2d81caeb60c356c64b3f7f7412c7

This pod will use the mynetwork network, allowing communication with other containers on that network.

9. How to Inspect a Network

Inspecting a network provides detailed information about the network configuration and connected containers.

Inspecting a Network

Use the podman network inspect command:

podman network inspect mynetwork

Example Output:

json

[

{

“name”: “mynetwork”,

“id”: “3c0d6e2eaf3c4f3b98a71c86f7b35d10b9d4f7b749b929a6d758b3f76cd1f8c6”,

“driver”: “bridge”,

“network_interface”: “cni-podman0”,

“created”: “2024-08-12T08:45:24.903716327Z”,

“subnets”: [

{

“subnet”: “10.88.1.0/24”,

“gateway”: “10.88.1.1”

}

“ipv6_enabled”: false,

“internal”: false,

“dns_enabled”: true,

“network_dns_servers”: [

“8.8.8.8”

]

}

]

This command will display detailed JSON output, including network interfaces, IP ranges, and connected containers.

10. How to Add a Static Address

Assigning a static IP address can be necessary for consistent network configurations.

Assigning a Static IP

When running a container, you can assign it a static IP address within a custom network:

podman run -d –network mynetwork –ip 10.88.1.100 ubuntu -c “apt update && apt install -y nginx && nginx -g ‘daemon off;'”

Example Output:

f05c2f18e41b4ef3a76a7b2349db20c10d9f2ff09f8c676eb08e9dc92f87c216

Ensure that the IP address is within the subnet range of your custom network.

11. How to Log On to a Container with

Accessing a container’s shell is often necessary for debugging or managing running applications.

Starting a Container with

If the container image includes , you can start it directly:

podman run -it ubuntu

Example Output:

ruby

root@e87b469f2e45:/#

Accessing a Running Container

To access an already running container:

podman exec -it <container_id>

Replace <container_id> with the actual ID or name of the container.

Example Output:

ruby

root@d2f56a8d1234:/#

admin August 13, 2024 PodmanNo Comments »

admin | August 11, 2024

What the Hell is Helm? (And Why You Should Care)

What the Hell is Helm?

If you’re tired of managing 10+ YAML files for every app, Helm is your new best friend. It’s basically the package manager for Kubernetes — like apt or brew — but for deploying apps in your cluster.

Instead of editing raw YAML over and over for each environment (dev, staging, prod), Helm lets you template it, inject dynamic values, and install with a single command.

Why Use Helm?

Here’s the reality:

You don’t want to maintain 3 sets of YAMLs for each environment.
You want to roll back fast if something breaks.
You want to reuse deployments across projects without rewriting.

Helm fixes all that. It gives you:

Templated YAML (no more copy-paste hell)
One chart, many environments
Version control + rollback support
Easy upgrades with helm upgrade
Access to thousands of ready-made charts from the community

Real Talk: What’s a Chart?

Think of a chart like a folder of YAML files with variables in it. You install it, pass in your config (values.yaml), and Helm renders the final manifests and applies them to your cluster.

When you install a chart, Helm creates a release — basically, a named instance of the chart running in your cluster.

How to Get Started (No BS)

1. Install Helm

brew install helm       # mac  
choco install kubernetes-helm  # windows  
sudo snap install helm  # linux

2. Create Your First Chart

helm create myapp

Boom — you now have a scaffolded chart in a folder with templates, a values file, and everything else you need.

Folder Breakdown

myapp/
├── Chart.yaml         # Metadata
├── values.yaml        # Config you can override
└── templates/         # All your actual Kubernetes YAMLs (as templates)

Example Template (deployment.yaml)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ .Release.Name }}-app
spec:
  replicas: {{ .Values.replicaCount }}
  selector:
    matchLabels:
      app: {{ .Chart.Name }}
  template:
    metadata:
      labels:
        app: {{ .Chart.Name }}
    spec:
      containers:
      - name: {{ .Chart.Name }}
        image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
        ports:
        - containerPort: 80

Example values.yaml

replicaCount: 3
image:
  repository: nginx
  tag: latest

Change the values, re-deploy, and you’re done.

Deploying to Your Cluster

helm install my-release ./myapp

Upgrading later?

helm upgrade my-release ./myapp -f prod-values.yaml

Roll it back?

helm rollback my-release 1

Uninstall it?

helm uninstall my-release

Simple. Clean. Versioned.

Want a Database?

Don’t write your own MySQL config. Just pull it from Bitnami’s chart repo:

helm repo add bitnami https://charts.bitnami.com/bitnami
helm install my-db bitnami/mysql

Done.

Helm lets you:

Turn your Kubernetes YAML into reusable templates
Manage config per environment without duplicating files
Version your deployments and roll back instantly
Install apps like MySQL, Redis, etc., with one command

It’s the smart way to scale your Kubernetes setup without losing your mind.

admin August 11, 2024 DevopsNo Comments »

admin | June 28, 2024

How to renable the tempurl in latest Cpanel

As some of you have noticed the new cpanel by default has a bunch of new default settings that nobody likes.

FTPserver is not configured out of the box.

TempURL is disabled for security reasons. Under certain conditions, a user can attack another user’s account if they access a malicious script through a mod_userdir URL.

So they removed it by default.

They did not provide instructions for people who need it. You can easily enable it BUT php wont work on the temp url unless you do the following

remove below: and by remove I mean you need to recompile easyapache 4 the following changes.

mod_ruid2
mod_passenger
mod_mpm_itk
mod_proxy_fcgi
mod_fcgid

Install

Mod_suexec
mod_suphp

Then go into Apache_mode_user dir tweak and enable it and exclude default host only.
It wont save the setting in the portal, but the configuration is updated. If you go back and look it will look like the settings didnt take. Looks like a bug in cpanel they need to fix on their front end.

Then PHP will work again on the tempurl.

admin June 28, 2024 CpanelNo Comments »

admin | June 27, 2024

How to integrate VROPS with Ansible

Automating VMware vRealize Operations (vROps) with Ansible

In the world of IT operations, automation is the key to efficiency and consistency. VMware’s vRealize Operations (vROps) provides powerful monitoring and management capabilities for virtualized environments. Integrating vROps with Ansible, an open-source automation tool, can take your infrastructure management to the next level. In this blog post, we’ll explore how to achieve this integration and demonstrate its benefits with a practical example.

What is vRealize Operations (vROps)?

vRealize Operations (vROps) is a comprehensive monitoring and analytics solution from VMware. It helps IT administrators manage the performance, capacity, and overall health of their virtual environments. Key features of vROps include:

• Performance Monitoring: Continuous tracking of VMs, hosts, and other resources.

• Capacity Management: Planning and optimizing resource usage.

• Troubleshooting: Identifying and resolving issues promptly.

• Automated Actions: Responding to specific events with predefined actions.

Why Integrate vROps with Ansible?

Integrating vROps with Ansible allows you to automate routine tasks, enforce consistent configurations, and rapidly respond to changes or issues in your virtual environment. This integration enables you to:

• Automate Monitoring Setup: Configure monitoring for new virtual machines or environments automatically.

• Trigger Remediation Actions: Automate responses to alerts generated by vROps.

• Generate Reports: Automate the creation and distribution of performance and capacity reports.

• Maintain Configuration Compliance: Ensure consistent vROps configurations across environments.

Setting Up the Integration

Prerequisites

Before you start, ensure you have:

1.vROps Environment: A running instance of VMware vRealize Operations.

2.Ansible Installed: Ansible should be installed on your control node.

Step-by-Step Guide

Step 1: Configure API Access in vROps

First, ensure you have the necessary API access in vROps. You’ll need:

• vROps Host: The URL of your vROps instance.

• vROps Username: A user with API access permissions.

• vROps Password: The password for the above user.

Step 2: Install Ansible

If you haven’t installed Ansible yet, you can do so by following these commands:

sudo apt update

sudo apt install ansible

Step 3: Create an Ansible Playbook

Create an Ansible playbook to interact with vROps. Below is an example playbook that retrieves the status of vROps resources.

Note: to use the other api end points you will need to acquire the token and set it as a fact to pass later.

Example

—

If you want to acquire the auth token:

—

– name: Authenticate with vROps and Check vROps Status

hosts: localhost

vars:

vrops_host: “your-vrops-host”

vrops_username: “your-username”

vrops_password: “your-password”

tasks:

– name: Authenticate with vROps

uri:

url: “https://{{ vrops_host }}/suite-api/api/auth/token/acquire”

method: POST

body_format: json

body:

username: “{{ vrops_username }}”

password: “{{ vrops_password }}”

headers:

Content-Type: “application/json“

validate_certs: no

– name: Fail if authentication failed

fail:

msg: “Authentication with vROps failed: {{ auth_response.json }}”

when: auth_response.status != 200

– name: Set auth token as fact

set_fact:

auth_token: “{{ auth_response.json.token }}”

– name: Get vROps status

uri:

url: “https://{{ vrops_host }}/suite-api/api/resources”

method: GET

headers:

Authorization: “vRealizeOpsToken {{ auth_token }}”

Content-Type: “application/json“

validate_certs: no

– name: Display vROps status

debug:

msg: “vROps response: {{ vrops_response.json }}”

Save this playbook to a file, for example, check_vrops_status.yml.

Step 4: Define Variables

Create a variables file to store your vROps credentials and host information.
Save it as vars.yml:

vrops_host: your-vrops-host

vrops_username: your-username

vrops_password: your-password

Step 5: Run the Playbook

Execute the playbook using the following command:

ansible-playbook -e @vars.yml check_vrops_status.yml

This above command runs the playbook and retrieves the status of vROps resources, displaying the results if you used the first example.

Here are some of the key API functions you can use:

The Authentication to use the endpoints listed below, you will need to acquire the auth token and set it as a fact to pass to other tasks inside ansible to use with the various endpoints below.

• Login: Authenticate and get a session token.

◦ Endpoint: POST /suite-api/api/auth/token/acquire

Resource Management

• Get Resources: Retrieve a list of resources managed by vROps.

◦ Endpoint: GET /suite-api/api/resources

• Get Resource by ID: Retrieve details of a specific resource.

◦ Endpoint: GET /suite-api/api/resources/{resourceId}

• Create Resource: Add a new resource to vROps.

◦ Endpoint: POST /suite-api/api/resources

• Update Resource: Update information for an existing resource.

◦ Endpoint: PUT /suite-api/api/resources/{resourceId}

• Delete Resource: Remove a resource from vROps.

◦ Endpoint: DELETE /suite-api/api/resources/{resourceId}

Metrics and Data

• Get Metrics for a Resource: Retrieve metrics for a specific resource.

◦ Endpoint: GET /suite-api/api/resources/{resourceId}/stats

• Get Metric Definitions: List available metrics for a resource kind.

◦ Endpoint: GET /suite-api/api/resources/kind/{resourceKindKey}/statkeys

• Get Historical Metrics: Retrieve historical metric data for a resource.

◦ Endpoint: GET /suite-api/api/resources/{resourceId}/stats/historical

Alerts and Notifications

• Get Alerts: Retrieve a list of alerts.

◦ Endpoint: GET /suite-api/api/alerts

• Get Alert by ID: Retrieve details of a specific alert.

◦ Endpoint: GET /suite-api/api/alerts/{alertId}

• Acknowledge Alert: Acknowledge a specific alert.

◦ Endpoint: POST /suite-api/api/alerts/{alertId}/acknowledge

• Cancel Alert: Cancel a specific alert.

◦ Endpoint: POST /suite-api/api/alerts/{alertId}/cancel

• Generate Notifications: Send notifications based on specific conditions.

◦ Endpoint: POST /suite-api/api/notifications

Policies and Configurations

• Get Policies: Retrieve a list of policies.

◦ Endpoint: GET /suite-api/api/policies

• Get Policy by ID: Retrieve details of a specific policy.

◦ Endpoint: GET /suite-api/api/policies/{policyId}

• Create Policy: Add a new policy.

◦ Endpoint: POST /suite-api/api/policies

• Update Policy: Update an existing policy.

◦ Endpoint: PUT /suite-api/api/policies/{policyId}

• Delete Policy: Remove a policy.

◦ Endpoint: DELETE /suite-api/api/policies/{policyId}

Dashboards and Reports

• Get Dashboards: Retrieve a list of dashboards.

◦ Endpoint: GET /suite-api/api/dashboards

• Get Dashboard by ID: Retrieve details of a specific dashboard.

◦ Endpoint: GET /suite-api/api/dashboards/{dashboardId}

• Create Dashboard: Add a new dashboard.

◦ Endpoint: POST /suite-api/api/dashboards

• Update Dashboard: Update an existing dashboard.

◦ Endpoint: PUT /suite-api/api/dashboards/{dashboardId}

• Delete Dashboard: Remove a dashboard.

◦ Endpoint: DELETE /suite-api/api/dashboards/{dashboardId}

• Get Reports: Retrieve a list of reports.

◦ Endpoint: GET /suite-api/api/reports

• Generate Report: Generate a new report based on a template.

◦ Endpoint: POST /suite-api/api/reports/{reportTemplateId}/generate

• Get Report by ID: Retrieve details of a specific report.

◦ Endpoint: GET /suite-api/api/reports/{reportId}

Capacity and Utilization

• Get Capacity Remaining: Retrieve remaining capacity for a specific resource.

◦ Endpoint: GET /suite-api/api/resources/{resourceId}/capacity/remaining

• Get Capacity Usage: Retrieve capacity usage for a specific resource.

◦ Endpoint: GET /suite-api/api/resources/{resourceId}/capacity/usage

Additional Functionalities

• Get Custom Groups: Retrieve a list of custom groups.

◦ Endpoint: GET /suite-api/api/groups

• Create Custom Group: Add a new custom group.

◦ Endpoint: POST /suite-api/api/groups

• Update Custom Group: Update an existing custom group.

◦ Endpoint: PUT /suite-api/api/groups/{groupId}

• Delete Custom Group: Remove a custom group.

◦ Endpoint: DELETE /suite-api/api/groups/{groupId}

• Get Recommendations: Retrieve a list of recommendations.

◦ Endpoint: GET /suite-api/api/recommendations

• Get Recommendation by ID: Retrieve details of a specific recommendation.

◦ Endpoint: GET /suite-api/api/recommendations/{recommendationId}

These are just a few examples of the many functions available through the vROps REST API.

admin June 27, 2024 vmwareNo Comments »

admin | June 11, 2024

More Cheat Sheet for DevOps Engineers

This guide is focused entirely on the most commonly used Kubernetes YAML examples and why you’d use them in a production or staging environment. These YAML definitions act as the foundation for automating, scaling, and managing containerized workloads.

1. Pod YAML (Basic Unit of Execution)

Use this when you want to run a single container on the cluster.

apiVersion: v1
kind: Pod
metadata:
  name: simple-pod
spec:
  containers:
  - name: nginx
    image: nginx

This is the most basic unit in Kubernetes. Ideal for testing and debugging.

2. Deployment YAML (For Scaling and Updates)

Use deployments to manage stateless apps with rolling updates and replicas.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
      - name: nginx
        image: nginx:1.21

3. Production-Ready Deployment Example

Use this to deploy a resilient application with health checks and resource limits.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: production-app
  labels:
    app: myapp
spec:
  replicas: 4
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      containers:
      - name: myapp-container
        image: myorg/myapp:2.1.0
        ports:
        - containerPort: 80
        livenessProbe:
          httpGet:
            path: /healthz
            port: 80
          initialDelaySeconds: 15
          periodSeconds: 20
        readinessProbe:
          httpGet:
            path: /ready
            port: 80
          initialDelaySeconds: 5
          periodSeconds: 10
        resources:
          requests:
            cpu: "250m"
            memory: "512Mi"
          limits:
            cpu: "500m"
            memory: "1Gi"

4. Service YAML (Stable Networking Access)

apiVersion: v1
kind: Service
metadata:
  name: web-service
spec:
  selector:
    app: web
  ports:
  - port: 80
    targetPort: 80
  type: ClusterIP

5. ConfigMap YAML (External Configuration)

apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
data:
  LOG_LEVEL: "debug"
  FEATURE_FLAG: "true"

6. Secret YAML (Sensitive Information)

apiVersion: v1
kind: Secret
metadata:
  name: app-secret
stringData:
  password: supersecret123

7. PersistentVolumeClaim YAML (For Storage)

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: app-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi

8. Job YAML (Run Once Tasks)

apiVersion: batch/v1
kind: Job
metadata:
  name: hello-job
spec:
  template:
    spec:
      containers:
      - name: hello
        image: busybox
        command: ["echo", "Hello World"]
      restartPolicy: Never

9. CronJob YAML (Recurring Tasks)

apiVersion: batch/v1
kind: CronJob
metadata:
  name: scheduled-task
spec:
  schedule: "*/5 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: task
            image: busybox
            args: ["/bin/sh", "-c", "echo Scheduled Job"]
          restartPolicy: OnFailure

10. Ingress YAML (Routing External Traffic)

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: web-ingress
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
spec:
  rules:
  - host: myapp.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: web-service
            port:
              number: 80

11. NetworkPolicy YAML (Security Control)

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-nginx
spec:
  podSelector:
    matchLabels:
      app: nginx
  policyTypes:
  - Ingress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: frontend

admin June 11, 2024 DevopsNo Comments »

admin | June 11, 2024

Cheat Sheet for DevOps Engineers

kubectl Cheat Sheet – In-Depth Guide

Managing Kubernetes clusters efficiently requires fluency with kubectl, the command-line interface for interacting with the Kubernetes API. Whether you’re deploying applications, viewing logs, or debugging infrastructure, this tool is your gateway to smooth cluster operations.

This in-depth cheat sheet will give you a comprehensive reference of how to use kubectl effectively in real-world operations, including advanced flags, filtering tricks, rolling updates, patching, output formatting, and resource exploration.

Shell Autocompletion

Boost productivity with shell autocompletion for kubectl:

Bash

source <(kubectl completion bash)
echo "source <(kubectl completion bash)" >> ~/.bashrc

ZSH

source <(kubectl completion zsh)
echo '[[ $commands[kubectl] ]] && source <(kubectl completion zsh)' >> ~/.zshrc

Aliases

alias k=kubectl
complete -o default -F __start_kubectl k

Working with Contexts & Clusters

kubectl config view                        # Show merged config
kubectl config get-contexts               # List contexts
kubectl config current-context            # Show current context
kubectl config use-context my-context     # Switch to context

Set default namespace for future commands:

kubectl config set-context --current --namespace=my-namespace

Deployments & YAML Management

kubectl apply -f ./manifest.yaml          # Apply resource(s)
kubectl apply -f ./dir/                   # Apply all YAMLs in directory
kubectl create deployment nginx --image=nginx
kubectl explain pod                       # Show pod schema

Apply resources from multiple sources:

kubectl apply -f ./one.yaml -f ./two.yaml
kubectl apply -f https://example.com/config.yaml

Viewing & Finding Resources

kubectl get pods                          # List all pods
kubectl get pods -o wide                  # Detailed pod listing
kubectl get services                      # List services
kubectl describe pod my-pod               # Detailed pod info

Filter & sort:

kubectl get pods --field-selector=status.phase=Running
kubectl get pods --sort-by='.status.containerStatuses[0].restartCount'

Find pod labels:

kubectl get pods --show-labels

Updating & Rolling Deployments

kubectl set image deployment/web nginx=nginx:1.19       # Set new image
kubectl rollout status deployment/web                   # Watch rollout
kubectl rollout history deployment/web                  # Show revisions
kubectl rollout undo deployment/web                     # Undo last change
kubectl rollout undo deployment/web --to-revision=2     # Revert to specific

Editing, Scaling, Deleting

kubectl edit deployment/web                    # Live edit YAML
kubectl scale deployment/web --replicas=5      # Scale up/down
kubectl delete pod my-pod                      # Delete by name
kubectl delete -f pod.yaml                     # Delete from file

Logs, Execs & Debugging

kubectl logs my-pod                            # View logs
kubectl logs my-pod -c container-name          # Logs from container
kubectl logs my-pod --previous                 # Logs from crashed
kubectl exec my-pod -- ls /                    # Run command
kubectl exec -it my-pod -- bash                # Shell access

Working with Services

kubectl expose pod nginx --port=80 --target-port=8080   # Create service
kubectl port-forward svc/my-service 8080:80             # Forward to local

Resource Metrics

kubectl top pod                              # Pod metrics
kubectl top pod --sort-by=cpu                # Sort by CPU
kubectl top node                             # Node metrics

Patching Resources

Strategic merge patch:

kubectl patch deployment my-deploy -p '{"spec":{"replicas":4}}'

JSON patch with array targeting:

kubectl patch pod my-pod --type='json' -p='[
  {"op": "replace", "path": "/spec/containers/0/image", "value":"nginx:1.21"}
]'

Cluster & Node Management

kubectl cordon my-node                        # Prevent new pods
kubectl drain my-node                         # Evict pods for maintenance
kubectl uncordon my-node                      # Resume scheduling
kubectl get nodes                             # List all nodes
kubectl describe node my-node                 # Node details

Output Formatting

kubectl get pods -o json
kubectl get pods -o yaml
kubectl get pods -o custom-columns="NAME:.metadata.name,IMAGE:.spec.containers[*].image"

Exploring API Resources

kubectl api-resources                         # List all resources
kubectl api-resources --namespaced=false      # Non-namespaced
kubectl api-resources -o wide                 # Extended info

Logging & Verbosity

kubectl get pods -v=6                         # Debug output
kubectl get deployment -v=9                   # Full API trace

Deployment Template Example

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  namespace: my-namespace
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:latest
        ports:
        - containerPort: 80

admin June 11, 2024 DevopsNo Comments »

admin | May 21, 2024

Deploying Production-Grade Systems on Oracle Cloud Infrastructure (OCI) with Terraform

Launching a virtual machine is easy. Running secure, reliable, production-grade systems is not. This guide shows how to deploy enterprise-ready compute infrastructure on Oracle Cloud Infrastructure (OCI) using Terraform, with a focus on security, fault tolerance, and long-term operability.

What “Production-Grade” Actually Means

A production environment is defined by predictability, not convenience. Production systems must survive failures, scale safely, and be observable at all times.

Private networking by default
No public SSH access
Replaceable compute instances
Persistent storage separated from OS
Infrastructure defined as code

Target Architecture Overview

Private VCN and subnet with NAT gateway for outbound access
Network Security Groups (NSGs) with explicit rules
Flex compute shape
Detached block storage with iSCSI attachment
SSH key authentication only

This architecture is suitable for:

SaaS backends
Internal APIs
Databases
AI / ML inference nodes
HPC control or login nodes

Terraform: Provider Configuration


terraform {
  required_version = ">= 1.6"

  required_providers {
    oci = {
      source  = "oracle/oci"
      version = ">= 5.0.0"
    }
  }
}

provider "oci" {
  tenancy_ocid     = var.tenancy_ocid
  user_ocid        = var.user_ocid
  fingerprint      = var.fingerprint
  private_key_path = var.private_key_path
  region           = var.region
}

This ensures reproducible deployments and enforces secure API-based authentication.

Variables


variable "tenancy_ocid" {
  description = "OCID of the tenancy"
  type        = string
}

variable "user_ocid" {
  description = "OCID of the user calling the API"
  type        = string
}

variable "fingerprint" {
  description = "Fingerprint of the API signing key"
  type        = string
}

variable "private_key_path" {
  description = "Path to the private key for API authentication"
  type        = string
}

variable "region" {
  description = "OCI region identifier"
  type        = string
}

variable "compartment_ocid" {
  description = "OCID of the compartment for resources"
  type        = string
}

variable "image_ocid" {
  description = "OCID of the compute image (e.g., Oracle Linux 8)"
  type        = string
}

variable "ssh_public_key" {
  description = "Path to SSH public key file"
  type        = string
}

variable "allowed_cidr" {
  description = "CIDR block allowed to access instances (e.g., VPN range)"
  type        = string
  default     = "10.0.0.0/16"
}

Data Sources


data "oci_identity_availability_domains" "ads" {
  compartment_id = var.tenancy_ocid
}

locals {
  availability_domain = data.oci_identity_availability_domains.ads.availability_domains[0].name
}

This retrieves the list of availability domains in your region. We select the first AD for simplicity, but production deployments should consider multi-AD placement.

Virtual Cloud Network (VCN)


resource "oci_core_vcn" "prod_vcn" {
  cidr_blocks    = ["10.0.0.0/16"]
  display_name   = "prod-vcn"
  dns_label      = "prodvcn"
  compartment_id = var.compartment_ocid
}

A /16 CIDR allows future expansion without redesign. VCNs act as the first isolation boundary for production systems.

Internet Gateway


resource "oci_core_internet_gateway" "igw" {
  compartment_id = var.compartment_ocid
  vcn_id         = oci_core_vcn.prod_vcn.id
  display_name   = "prod-igw"
  enabled        = true
}

The internet gateway enables outbound connectivity for the NAT gateway. It does not expose private instances directly.

NAT Gateway


resource "oci_core_nat_gateway" "nat_gw" {
  compartment_id = var.compartment_ocid
  vcn_id         = oci_core_vcn.prod_vcn.id
  display_name   = "prod-nat-gw"
  block_traffic  = false
}

The NAT gateway allows private subnet instances to reach the internet for package updates and external API calls without exposing inbound access.

Route Tables


resource "oci_core_route_table" "private_rt" {
  compartment_id = var.compartment_ocid
  vcn_id         = oci_core_vcn.prod_vcn.id
  display_name   = "private-route-table"

  route_rules {
    destination       = "0.0.0.0/0"
    destination_type  = "CIDR_BLOCK"
    network_entity_id = oci_core_nat_gateway.nat_gw.id
  }
}

All outbound traffic from the private subnet routes through the NAT gateway. This ensures instances can reach external resources without being directly accessible.

Private Subnet (No Public IPs)


resource "oci_core_subnet" "private_subnet" {
  cidr_block                 = "10.0.1.0/24"
  vcn_id                     = oci_core_vcn.prod_vcn.id
  compartment_id             = var.compartment_ocid
  display_name               = "private-subnet"
  prohibit_public_ip_on_vnic = true
  route_table_id             = oci_core_route_table.private_rt.id
  dns_label                  = "private"
}

Instances in this subnet are never reachable from the internet. Access must go through a bastion, VPN, or private load balancer.

Network Security Groups (NSG)


resource "oci_core_network_security_group" "app_nsg" {
  compartment_id = var.compartment_ocid
  vcn_id         = oci_core_vcn.prod_vcn.id
  display_name   = "app-nsg"
}

# Allow SSH from internal network only
resource "oci_core_network_security_group_security_rule" "allow_ssh" {
  network_security_group_id = oci_core_network_security_group.app_nsg.id
  direction                 = "INGRESS"
  protocol                  = "6" # TCP

  source      = var.allowed_cidr
  source_type = "CIDR_BLOCK"

  tcp_options {
    destination_port_range {
      min = 22
      max = 22
    }
  }
}

# Allow HTTPS from internal network
resource "oci_core_network_security_group_security_rule" "allow_https" {
  network_security_group_id = oci_core_network_security_group.app_nsg.id
  direction                 = "INGRESS"
  protocol                  = "6" # TCP

  source      = var.allowed_cidr
  source_type = "CIDR_BLOCK"

  tcp_options {
    destination_port_range {
      min = 443
      max = 443
    }
  }
}

# Allow all outbound traffic
resource "oci_core_network_security_group_security_rule" "allow_egress" {
  network_security_group_id = oci_core_network_security_group.app_nsg.id
  direction                 = "EGRESS"
  protocol                  = "all"

  destination      = "0.0.0.0/0"
  destination_type = "CIDR_BLOCK"
}

# Allow ICMP for path MTU discovery
resource "oci_core_network_security_group_security_rule" "allow_icmp" {
  network_security_group_id = oci_core_network_security_group.app_nsg.id
  direction                 = "INGRESS"
  protocol                  = "1" # ICMP

  source      = "10.0.0.0/16"
  source_type = "CIDR_BLOCK"

  icmp_options {
    type = 3
    code = 4
  }
}

NSGs provide service-level firewalling and are preferred over subnet-wide security lists. These rules allow SSH and HTTPS only from your internal network, while permitting all outbound traffic.

Compute Instance (Flex Shape)


resource "oci_core_instance" "prod_instance" {
  availability_domain = local.availability_domain
  compartment_id      = var.compartment_ocid
  display_name        = "prod-app-01"
  shape               = "VM.Standard.E4.Flex"

  shape_config {
    ocpus         = 2
    memory_in_gbs = 16
  }

  create_vnic_details {
    subnet_id        = oci_core_subnet.private_subnet.id
    assign_public_ip = false
    nsg_ids          = [oci_core_network_security_group.app_nsg.id]
    hostname_label   = "prod-app-01"
  }

  source_details {
    source_type             = "image"
    source_id               = var.image_ocid
    boot_volume_size_in_gbs = 50
  }

  metadata = {
    ssh_authorized_keys = file(var.ssh_public_key)
  }

  preserve_boot_volume = true
}

Flex shapes allow independent scaling of CPU and memory, ensuring predictable performance without overpaying for unused resources. Setting preserve_boot_volume = true protects the boot volume if the instance is accidentally terminated.

Persistent Block Storage


resource "oci_core_volume" "data_volume" {
  availability_domain = local.availability_domain
  compartment_id      = var.compartment_ocid
  display_name        = "prod-data-vol"
  size_in_gbs         = 200
  vpus_per_gb         = 10 # Balanced performance tier
}

resource "oci_core_volume_attachment" "data_attach" {
  attachment_type = "paravirtualized"
  instance_id     = oci_core_instance.prod_instance.id
  volume_id       = oci_core_volume.data_volume.id
  display_name    = "prod-data-attachment"
}

Separating OS and data ensures instances are disposable while data remains protected. Paravirtualized attachments are simpler than iSCSI and work automatically on Oracle Linux.

Post-Deployment: Mounting the Block Volume

After Terraform applies, SSH into the instance and mount the volume:


# Find the attached volume (usually /dev/sdb)
lsblk

# Create filesystem (first time only)
sudo mkfs.xfs /dev/sdb

# Create mount point and mount
sudo mkdir -p /data
sudo mount /dev/sdb /data

# Add to fstab for persistence across reboots
echo '/dev/sdb /data xfs defaults,_netdev,nofail 0 2' | sudo tee -a /etc/fstab

The _netdev and nofail options ensure the system boots even if the volume is temporarily unavailable.

Outputs


output "instance_private_ip" {
  description = "Private IP address of the compute instance"
  value       = oci_core_instance.prod_instance.private_ip
}

output "instance_id" {
  description = "OCID of the compute instance"
  value       = oci_core_instance.prod_instance.id
}

output "vcn_id" {
  description = "OCID of the VCN"
  value       = oci_core_vcn.prod_vcn.id
}

output "volume_id" {
  description = "OCID of the data volume"
  value       = oci_core_volume.data_volume.id
}

Security & Operational Checklist

✓ No public SSH access
✓ Key-based authentication only
✓ Private networking with NAT for outbound
✓ Explicit NSG rules (no default allow)
✓ Persistent storage with separate lifecycle
✓ Infrastructure fully defined in code
✓ Boot volume preservation enabled

What to Add Next

Bastion Service – OCI’s managed bastion for secure SSH access without VPN
Site-to-Site VPN – Connect to on-premises networks
OCI Load Balancer – For multi-instance deployments
Monitoring and Alerting – OCI Monitoring service with custom alarms
Dynamic Groups and IAM policies – Instance principals for secure API access
Cloud-init or Ansible – OS hardening and application deployment
CI/CD pipelines – GitOps workflow for Terraform changes
Volume backups – Scheduled backup policies for data protection

Example terraform.tfvars


tenancy_ocid     = "ocid1.tenancy.oc1..aaaaaaaaexample"
user_ocid        = "ocid1.user.oc1..aaaaaaaaexample"
fingerprint      = "aa:bb:cc:dd:ee:ff:00:11:22:33:44:55:66:77:88:99"
private_key_path = "~/.oci/oci_api_key.pem"
region           = "eu-frankfurt-1"
compartment_ocid = "ocid1.compartment.oc1..aaaaaaaaexample"
image_ocid       = "ocid1.image.oc1.eu-frankfurt-1.aaaaaaaaexample"
ssh_public_key   = "~/.ssh/id_rsa.pub"
allowed_cidr     = "10.0.0.0/16"

Nick Tailor’s Thoughts

Production infrastructure is not about clicking faster. It is about repeatability, security, and recovery. OCI combined with Terraform provides an extremely strong foundation when engineered correctly from day one.

If you treat infrastructure as software, production becomes predictable.

The complete code from this guide is available as a ready-to-use Terraform module. Clone it, update your variables, and run terraform apply to deploy.

admin May 21, 2024 Oracle Cloud, TerraformNo Comments »

Nick Tailor's Technical Blog

A detail-minded individual, combining strong technical understanding and communication skills with experiences in Systems Administration, Engineering, Automation, AI Automation and Solutions; a proven methodical problem solver.

Author: admin

How to configure Slurm Controller Node on Ubuntu 22.04

How to setup HPC-Slurm Controller Node

Step 1: Install Prerequisites

Step 2: Configure Munge (Authentication for slurm)

Step 3: Install Slurm

Step 4: Configure Slurm; more complex configs contact Nick Tailor

Troubleshooting Common Errors

Error: Permission Denied for /var/spool/slurm

Error: Temporary Failure in Name Resolution

Error: High Latency for gettimeofday()

Step 5: Verify and Test the Setup

Kubernetes Cheat Sheet

kubectl Context and Configuration

kubectl apply (Declarative Management)

Viewing and Finding Resources

Updating Resources and Rolling Updates

Patching Resources

Editing Resources

Scaling Resources

Deleting Resources

Interacting With Running Pods

Copying Files to and from Containers

Interacting With Nodes and Cluster

Discovering API Resources

Kubectl Output Formatting

Kubectl Verbosity and Debugging

Production-Ready Deployment YAML (Corrected)

Conclusion

Kubernetes Cheat Sheet

kubectl Context and Configuration

kubectl apply (Declarative Management)

Viewing and Finding Resources

Updating Resources and Rolling Updates

Patching Resources

Editing Resources

Scaling Resources

Deleting Resources

Interacting With Running Pods

Copying Files to and from Containers

Interacting With Nodes and Cluster

Discovering API Resources

Kubectl Output Formatting

Kubectl Verbosity and Debugging

Production-Ready Deployment YAML (Corrected)

Conclusion

Deploying Lustre File System with RDMA, Node Maps, and ACLs

1. Understanding the Lustre Architecture

2. Prerequisites and Environment

3. Install Lustre Packages

On MGS, MDT, and OSS Nodes:

On Client Nodes:

4. Configure InfiniBand and RDMA (o2ib)

1. Install and verify InfiniBand stack

2. Configure IB network

3. Verify RDMA link

4. Configure LNET for o2ib

5. Format and Mount Lustre Targets

Metadata Server (MGS + MDT)

Object Storage Server (OSS)

Client Node

6. Configuring Node Maps (Access Control)

1. View current node maps

2. Create a new node map for trusted clients

3. Add allowed network range or host

4. Enable enforcement

5. Restrict default map

7. Configuring Access Control Lists (ACLs)

1. Enable ACL support on mount

2. Verify ACL support

3. Set ACLs on directories

4. View ACLs

8. Verifying Cluster Health

On all nodes:

Check RDMA performance:

Check file system mount from client:

Optional: Check node map enforcement

9. Common Issues and Troubleshooting