Author: admin

Automating Rocky Linux VM Creation with Packer + VirtualBox

If you’ve ever needed to spin up a clean, minimal Linux VM for testing or local automation — and got tired of clicking through the VirtualBox GUI — this guide is for you.

We’ll walk through how to use HashiCorp Packer and VirtualBox to automatically create a Rocky Linux 8.10 image, ready to boot and use — no Vagrant, no fluff.

What You’ll Need

  • Packer installed
  • VirtualBox installed
  • Rocky Linux 8.10 ISO link (we use minimal)
  • Basic understanding of Linux + VirtualBox

Project Structure

packer-rocky/
├── http/
│   └── ks.cfg       # Kickstart file for unattended install
├── rocky.pkr.hcl    # Main Packer config

Step 1: Create the Kickstart File (http/ks.cfg)

install
cdrom
lang en_US.UTF-8
keyboard us
network --bootproto=dhcp
rootpw packer
firewall --disabled
selinux --permissive
timezone UTC
bootloader --location=mbr
text
skipx
zerombr

# Partition disk
clearpart --all --initlabel
part /boot --fstype="xfs" --size=1024
part pv.01 --fstype="lvmpv" --grow
volgroup vg0 pv.01
logvol / --vgname=vg0 --fstype="xfs" --size=10240 --name=root
logvol swap --vgname=vg0 --size=4096 --name=swap

reboot

%packages --ignoremissing
@core
@base
%end

%post
# Post-install steps can be added here
%end

Step 2: Create the Packer HCL Template (rocky.pkr.hcl)

packer {
  required_plugins {
    virtualbox = {
      version = ">= 1.0.5"
      source  = "github.com/hashicorp/virtualbox"
    }
  }
}

source "virtualbox-iso" "rocky" {
  iso_url                 = "https://download.rockylinux.org/pub/rocky/8/isos/x86_64/Rocky-8.10-x86_64-minimal.iso"
  iso_checksum            = "2c735d3b0de921bd671a0e2d08461e3593ac84f64cdaef32e3ed56ba01f74f4b"
  guest_os_type           = "RedHat_64"
  memory                  = 2048
  cpus                    = 2
  disk_size               = 40000
  vm_name                 = "rocky-8"
  headless                = false
  guest_additions_mode    = "disable"
  boot_command            = [" inst.text inst.ks=http://{{ .HTTPIP }}:{{ .HTTPPort }}/ks.cfg"]
  http_directory          = "http"
  ssh_username            = "root"
  ssh_password            = "packer"
  ssh_timeout             = "20m"
  shutdown_command        = "shutdown -P now"
  vboxmanage = [
    ["modifyvm", "{{.Name}}", "--vram", "32"],
    ["modifyvm", "{{.Name}}", "--vrde", "off"],
    ["modifyvm", "{{.Name}}", "--ioapic", "off"],
    ["modifyvm", "{{.Name}}", "--pae", "off"],
    ["modifyvm", "{{.Name}}", "--nested-hw-virt", "on"]
  ]
}

build {
  sources = ["source.virtualbox-iso.rocky"]
}

Step 3: Run the Build

cd packer-rocky
packer init .
packer build .

Packer will:

  1. Download and boot the ISO in VirtualBox
  2. Serve the ks.cfg file over HTTP
  3. Automatically install Rocky Linux
  4. Power off the machine once complete

Result

You now have a fully installed Rocky Linux 8.10 image in VirtualBox — no manual setup required.

OpenShift Architecture & Migration Design: Building Secure, Scalable Enterprise Platforms

Designing and migrating to OpenShift is not about installing a cluster. It is about controlling failure domains, aligning schedulers, and avoiding hidden infrastructure bottlenecks that only surface under load or during outages.

This post walks through concrete implementation patterns using Terraform and Ansible, explains why they exist, and highlights what will break if you get them wrong.


Migration Strategy: Phased Approach

Every failed migration I have seen skipped or compressed one of these phases. The pressure to “just move it” creates technical debt that surfaces as production incidents.

Phase 1: Discovery and Assessment

Before touching infrastructure, you need a complete inventory of what exists and how it behaves.


# VMware dependency discovery script
# Export VM metadata, network connections, storage mappings

$vms = Get-VM | Select-Object Name, PowerState, NumCpu, MemoryGB,
  @{N='Datastore';E={(Get-Datastore -VM $_).Name}},
  @{N='Network';E={(Get-NetworkAdapter -VM $_).NetworkName}},
  @{N='VMHost';E={$_.VMHost.Name}}

$vms | Export-Csv -Path "vm-inventory.csv" -NoTypeInformation

# Capture network flows for dependency mapping
# Run for minimum 2 weeks to capture batch jobs and monthly processes

What you are looking for:

  • Hard-coded IPs in application configs
  • NFS mounts and shared storage dependencies
  • Inter-VM communication patterns (what talks to what)
  • Authentication integrations (LDAP, AD, service accounts)
  • Scheduled jobs and their timing dependencies

Assessment deliverables:

  • Application dependency map
  • Containerization readiness score per workload
  • Risk register with mitigation strategies
  • Estimated effort per application (T-shirt sizing)

Phase 2: Target Architecture Design

Design the OpenShift environment before migration begins. This includes cluster topology, namespace strategy, and resource quotas.


# Namespace strategy example
# Environments separated by namespace, not cluster

apiVersion: v1
kind: Namespace
metadata:
  name: app-prod
  labels:
    environment: production
    cost-center: "12345"
    data-classification: confidential
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: app-prod-quota
  namespace: app-prod
spec:
  hard:
    requests.cpu: "40"
    requests.memory: 80Gi
    limits.cpu: "80"
    limits.memory: 160Gi
    persistentvolumeclaims: "20"

Phase 3: Pilot Migration

Select two to three non-critical applications that exercise different patterns:

  • One stateless web application
  • One application with persistent storage
  • One application with external integrations

The pilot validates your tooling, processes, and assumptions before you scale.

Phase 4: Wave Migration

Group applications into waves based on dependencies and risk. Each wave should be independently deployable and rollback-capable.


# Wave planning structure
wave_1:
  applications:
    - name: static-website
      risk: low
      dependencies: none
      estimated_downtime: 0
  success_criteria:
    - all pods healthy for 24 hours
    - response times within 10% of baseline
    - zero error rate increase

wave_2:
  applications:
    - name: api-gateway
      risk: medium
      dependencies:
        - static-website
      estimated_downtime: 5 minutes
  gate:
    - wave_1 success criteria met
    - stakeholder sign-off

Phase 5: Cutover and Decommission

Final traffic switch and legacy teardown. This is where DNS TTL planning matters.

Common cutover failures:

  • DNS TTLs not reduced in advance (reduce to 60 seconds, 48 hours before cutover)
  • Client-side caching ignoring TTL
  • Hardcoded IPs in partner systems
  • Certificate mismatches after DNS change

VMware to OpenShift: Migration Patterns

Not every VM becomes a container. The migration pattern depends on application architecture, not convenience.

Pattern 1: Lift and Containerize

For applications that are already 12-factor compliant or close to it. Package existing binaries into containers with minimal modification.


# Dockerfile for legacy Java application
FROM registry.access.redhat.com/ubi8/openjdk-11-runtime

COPY target/app.jar /deployments/app.jar

ENV JAVA_OPTS="-Xms512m -Xmx2048m"

EXPOSE 8080
CMD ["java", "-jar", "/deployments/app.jar"]

When to use: Application reads config from environment variables, logs to stdout, and has no local state.

Pattern 2: Replatform with Refactoring

Application requires changes to run in containers but core logic remains. Typical changes include externalizing configuration and adding health endpoints.


# Spring Boot health endpoint addition
management:
  endpoints:
    web:
      exposure:
        include: health,info,prometheus
  endpoint:
    health:
      probes:
        enabled: true
      show-details: always

When to use: Application has some container-unfriendly patterns (file-based config, local logging) but is otherwise sound.

Pattern 3: Retain on VM

Some workloads should not be containerized:

  • Legacy applications with kernel dependencies
  • Workloads requiring specific hardware (GPU passthrough, SR-IOV)
  • Applications with licensing tied to VM or physical host
  • Databases with extreme I/O requirements (evaluate case by case)

OpenShift Virtualization (KubeVirt) can run VMs alongside containers when needed.


apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: legacy-app-vm
spec:
  running: true
  template:
    spec:
      domain:
        cpu:
          cores: 4
        memory:
          guest: 8Gi
        devices:
          disks:
            - name: rootdisk
              disk:
                bus: virtio
      volumes:
        - name: rootdisk
          persistentVolumeClaim:
            claimName: legacy-app-pvc

Pattern 4: Rebuild or Replace

Application is fundamentally incompatible and would require complete rewrite. Evaluate whether a commercial off-the-shelf replacement makes more sense.

Decision matrix:

FactorContainerizeKeep on VMReplace
Strategic valueHighLow/LegacyMedium
Maintenance costAcceptableHigh but stableUnsustainable
12-factor compliancePartial or fullNoneN/A
Vendor supportAvailableLegacy onlyN/A

Infrastructure Provisioning with Terraform

Why Terraform First

OpenShift installation assumes the underlying infrastructure is deterministic. If VM placement, CPU topology, or networking varies between environments, the cluster will behave differently under identical workloads. Terraform is used to lock infrastructure intent before OpenShift ever runs.


Example: vSphere Control Plane and Worker Provisioning


provider "vsphere" {
  user           = var.vsphere_user
  password       = var.vsphere_password
  vsphere_server = var.vsphere_server
  allow_unverified_ssl = true
}

data "vsphere_datacenter" "dc" {
  name = var.datacenter
}

data "vsphere_compute_cluster" "cluster" {
  name          = var.cluster
  datacenter_id = data.vsphere_datacenter.dc.id
}

data "vsphere_datastore" "datastore" {
  name          = var.datastore
  datacenter_id = data.vsphere_datacenter.dc.id
}

data "vsphere_network" "network" {
  name          = var.network
  datacenter_id = data.vsphere_datacenter.dc.id
}

resource "vsphere_virtual_machine" "control_plane" {
  count            = 3
  name             = "ocp-master-${count.index}"
  resource_pool_id = data.vsphere_compute_cluster.cluster.resource_pool_id
  datastore_id     = data.vsphere_datastore.datastore.id
  folder           = var.vm_folder

  num_cpus = 8
  memory   = 32768
  guest_id = "rhel8_64Guest"

  # Critical: Reservations prevent resource contention
  cpu_reservation    = 8000
  memory_reservation = 32768

  # Anti-affinity rule reference
  depends_on = [vsphere_compute_cluster_vm_anti_affinity_rule.control_plane_anti_affinity]

  network_interface {
    network_id = data.vsphere_network.network.id
  }

  disk {
    label            = "root"
    size             = 120
    thin_provisioned = false  # Thick provisioning for control plane
  }

  disk {
    label            = "etcd"
    size             = 100
    thin_provisioned = false
    unit_number      = 1
  }
}

# Anti-affinity ensures control plane nodes run on different hosts
resource "vsphere_compute_cluster_vm_anti_affinity_rule" "control_plane_anti_affinity" {
  name               = "ocp-control-plane-anti-affinity"
  compute_cluster_id = data.vsphere_compute_cluster.cluster.id
  virtual_machine_ids = [for vm in vsphere_virtual_machine.control_plane : vm.id]
}

CPU and memory reservations are not optional. Without them, vSphere ballooning and scheduling delays will surface as random etcd latency and API instability.

What usually breaks:

  • etcd timeouts under load (etcd requires consistent sub-10ms disk latency)
  • API server flapping during node pressure
  • Unexplained cluster degradation after vMotion events
  • Split-brain scenarios when anti-affinity is not enforced

Worker Node Pools by Workload Type


resource "vsphere_virtual_machine" "workers_general" {
  count = 6
  name  = "ocp-worker-general-${count.index}"

  num_cpus = 8
  memory   = 32768

  # General workers can use thin provisioning
  disk {
    label            = "root"
    size             = 120
    thin_provisioned = true
  }
}

resource "vsphere_virtual_machine" "workers_stateful" {
  count = 3
  name  = "ocp-worker-stateful-${count.index}"

  num_cpus = 16
  memory   = 65536

  # Stateful workers need guaranteed resources
  cpu_reservation    = 16000
  memory_reservation = 65536

  disk {
    label            = "root"
    size             = 120
    thin_provisioned = false
  }
}

resource "vsphere_virtual_machine" "workers_infra" {
  count = 3
  name  = "ocp-worker-infra-${count.index}"

  num_cpus = 8
  memory   = 32768

  # Infrastructure nodes for routers, monitoring, logging
  disk {
    label            = "root"
    size             = 200
    thin_provisioned = false
  }
}

Different workloads require different failure and performance characteristics. Trying to “let Kubernetes figure it out” leads to noisy neighbors and unpredictable latency.


Post-Provision Configuration with Ansible

Why Ansible Is Still Required

Terraform stops at infrastructure. OpenShift nodes require OS-level hardening, kernel tuning, and configuration consistency before installation. Ignoring this step leads to subtle instability that manifests weeks later.


Example: Node OS Hardening


---
- name: Prepare OpenShift nodes
  hosts: openshift_nodes
  become: true
  tasks:

    - name: Disable swap
      command: swapoff -a
      changed_when: false

    - name: Remove swap from fstab
      replace:
        path: /etc/fstab
        regexp: '^([^#].*swap.*)$'
        replace: '# \1'

    - name: Set kernel parameters for OpenShift
      sysctl:
        name: "{{ item.key }}"
        value: "{{ item.value }}"
        state: present
        sysctl_file: /etc/sysctl.d/99-openshift.conf
      loop:
        - { key: net.ipv4.ip_forward, value: 1 }
        - { key: net.bridge.bridge-nf-call-iptables, value: 1 }
        - { key: net.bridge.bridge-nf-call-ip6tables, value: 1 }
        - { key: vm.max_map_count, value: 262144 }
        - { key: fs.inotify.max_user_watches, value: 1048576 }
        - { key: fs.inotify.max_user_instances, value: 8192 }
        - { key: net.core.somaxconn, value: 32768 }
        - { key: net.ipv4.tcp_max_syn_backlog, value: 32768 }

    - name: Load required kernel modules
      modprobe:
        name: "{{ item }}"
        state: present
      loop:
        - br_netfilter
        - overlay
        - ip_vs
        - ip_vs_rr
        - ip_vs_wrr
        - ip_vs_sh

    - name: Ensure kernel modules load on boot
      copy:
        dest: /etc/modules-load.d/openshift.conf
        content: |
          br_netfilter
          overlay
          ip_vs
          ip_vs_rr
          ip_vs_wrr
          ip_vs_sh

These values are not arbitrary. OpenShift components and container runtimes will fail silently or degrade under load if kernel defaults are used.


Container Runtime Configuration


- name: Configure CRI-O
  copy:
    dest: /etc/crio/crio.conf.d/99-custom.conf
    content: |

[crio.runtime]

default_ulimits = [ “nofile=1048576:1048576” ] pids_limit = 4096

[crio.image]

pause_image = “registry.redhat.io/openshift4/ose-pod:latest” – name: Configure container storage copy: dest: /etc/containers/storage.conf content: |

[storage]

driver = “overlay” runroot = “/run/containers/storage” graphroot = “/var/lib/containers/storage”

[storage.options.overlay]

mountopt = “nodev,metacopy=on”

Default ulimits are insufficient for high-density clusters. You will hit file descriptor exhaustion before CPU or memory limits.


Security Architecture

RBAC Design Principles

Role-based access control should follow least privilege. Avoid cluster-admin grants; use namespace-scoped roles.


# Developer role - namespace scoped
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: developer
  namespace: app-dev
rules:
  - apiGroups: ["", "apps", "batch"]
    resources: ["pods", "deployments", "services", "configmaps", "secrets", "jobs", "cronjobs"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
  - apiGroups: [""]
    resources: ["pods/log", "pods/exec"]
    verbs: ["get", "create"]
  # Explicitly deny access to cluster resources
  - apiGroups: [""]
    resources: ["nodes", "persistentvolumes"]
    verbs: []
---
# Operations role - read-only cluster wide, write in specific namespaces
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: operations-readonly
rules:
  - apiGroups: [""]
    resources: ["nodes", "namespaces", "persistentvolumes"]
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources: ["pods", "services", "endpoints"]
    verbs: ["get", "list", "watch"]

Network Policies

Default deny with explicit allows. Every namespace should have a baseline policy.


# Default deny all ingress
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
  namespace: app-prod
spec:
  podSelector: {}
  policyTypes:
    - Ingress
---
# Allow ingress from same namespace only
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-same-namespace
  namespace: app-prod
spec:
  podSelector: {}
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector: {}
---
# Allow ingress from OpenShift router
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-from-router
  namespace: app-prod
spec:
  podSelector:
    matchLabels:
      app: web-frontend
  policyTypes:
    - Ingress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              network.openshift.io/policy-group: ingress

Image Security and Supply Chain


# Image policy to restrict registries
apiVersion: config.openshift.io/v1
kind: Image
metadata:
  name: cluster
spec:
  registrySources:
    allowedRegistries:
      - registry.redhat.io
      - registry.access.redhat.com
      - quay.io
      - ghcr.io
      - registry.internal.example.com
    blockedRegistries:
      - docker.io  # Block Docker Hub for compliance
---
# Require signed images in production
apiVersion: policy.sigstore.dev/v1alpha1
kind: ClusterImagePolicy
metadata:
  name: require-signatures
spec:
  images:
    - glob: "registry.internal.example.com/prod/**"
  authorities:
    - keyless:
        url: https://fulcio.sigstore.dev
        identities:
          - issuer: https://accounts.google.com
            subject: release-team@example.com

Pod Security Standards


# Enforce restricted security context
apiVersion: v1
kind: Namespace
metadata:
  name: app-prod
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted
---
# Security Context Constraints for OpenShift
apiVersion: security.openshift.io/v1
kind: SecurityContextConstraints
metadata:
  name: app-restricted
allowPrivilegedContainer: false
allowPrivilegeEscalation: false
requiredDropCapabilities:
  - ALL
runAsUser:
  type: MustRunAsNonRoot
seLinuxContext:
  type: MustRunAs
fsGroup:
  type: MustRunAs
volumes:
  - configMap
  - emptyDir
  - projected
  - secret
  - downwardAPI
  - persistentVolumeClaim

Networking: Where Most Migrations Fail

Ingress and Load Balancer Alignment

External load balancers must align with OpenShift router expectations. Health checks should target readiness endpoints, not TCP ports.


# HAProxy configuration for OpenShift routers
frontend openshift_router_https
    bind *:443
    mode tcp
    option tcplog
    default_backend openshift_router_https_backend

backend openshift_router_https_backend
    mode tcp
    balance source
    option httpchk GET /healthz/ready HTTP/1.1\r\nHost:\ router-health
    http-check expect status 200
    server router-0 192.168.1.10:443 check port 1936 inter 5s fall 3 rise 2
    server router-1 192.168.1.11:443 check port 1936 inter 5s fall 3 rise 2
    server router-2 192.168.1.12:443 check port 1936 inter 5s fall 3 rise 2

Common failure: Load balancer marks routers healthy while the application is unavailable. TCP health checks pass even when the router pod is terminating.


MTU and Overlay Networking

MTU mismatches between underlay, NSX, and OpenShift overlays cause:

  • Intermittent pod-to-pod packet loss
  • gRPC failures (large payloads fragment incorrectly)
  • Random CI/CD pipeline timeouts
  • TLS handshake failures

# Verify MTU across the path
# Physical network: 9000 (jumbo frames)
# NSX overlay: 8900 (100 byte overhead)
# OpenShift OVN: 8800 (additional 100 byte overhead)

# Test from inside a pod
kubectl exec -it debug-pod -- ping -M do -s 8772 target-service

# If this fails, reduce MTU until it works
# Then configure cluster network appropriately

# OpenShift cluster network configuration
apiVersion: operator.openshift.io/v1
kind: Network
metadata:
  name: cluster
spec:
  clusterNetwork:
    - cidr: 10.128.0.0/14
      hostPrefix: 23
  serviceNetwork:
    - 172.30.0.0/16
  defaultNetwork:
    type: OVNKubernetes
    ovnKubernetesConfig:
      mtu: 8800
      genevePort: 6081

This is almost never diagnosed correctly on first pass. Symptoms look like application bugs.


DNS Configuration for Migration


# CoreDNS custom configuration for migration
apiVersion: v1
kind: ConfigMap
metadata:
  name: dns-custom
  namespace: openshift-dns
data:
  legacy.server: |
    legacy.example.com:53 {
      forward . 10.0.0.53 10.0.0.54
      cache 30
    }

During migration, pods may need to resolve legacy DNS names. Configure forwarding rules before cutting over applications.


Storage: Persistent Volumes and CSI Reality

StorageClass Design


# Pure Storage FlashArray - Fast tier
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: pure-fast
  annotations:
    storageclass.kubernetes.io/is-default-class: "false"
provisioner: pure-csi
parameters:
  backend: flasharray
  csi.storage.k8s.io/fstype: xfs
  createoptions: -q
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
# Pure Storage FlashBlade - Shared/NFS tier
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: pure-shared
provisioner: pure-csi
parameters:
  backend: flashblade
  exportRules: "*(rw,no_root_squash)"
reclaimPolicy: Retain
volumeBindingMode: Immediate
---
# Standard tier for non-critical workloads
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: standard
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: kubernetes.io/vsphere-volume
parameters:
  diskformat: thin
  datastore: vsanDatastore
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

WaitForFirstConsumer is critical for block storage. Without it, volumes are bound before pod placement, breaking topology-aware scheduling.

What breaks if ignored:

  • Pods stuck in Pending state
  • Volumes attached to unreachable nodes
  • Zone-aware deployments fail silently

Stateful Application Migration


# Database migration pattern using PVC cloning
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: db-data-migrated
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: pure-fast
  resources:
    requests:
      storage: 500Gi
  dataSource:
    kind: PersistentVolumeClaim
    name: db-data-legacy

Observability and Migration Validation

Baseline Metrics Before Migration

You cannot validate a migration without knowing what normal looks like. Capture baselines for at least two weeks before migration.


# Key metrics to baseline
# Application metrics
- request_duration_seconds (p50, p95, p99)
- request_total (rate)
- error_total (rate)
- active_connections

# Infrastructure metrics
- cpu_usage_percent
- memory_usage_bytes
- disk_io_seconds
- network_bytes_transmitted
- network_bytes_received

# Business metrics
- transactions_per_second
- successful_checkouts
- user_sessions_active

Prometheus Rules for Migration


apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: migration-validation
  namespace: openshift-monitoring
spec:
  groups:
    - name: migration.rules
      rules:
        # Alert if latency increases more than 20% post-migration
        - alert: MigrationLatencyRegression
          expr: |
            (
              histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{migrated="true"}[5m])) by (le, service))
              /
              histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{migrated="false"}[5m])) by (le, service))
            ) > 1.2
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Latency regression detected post-migration"
            description: "Service {{ $labels.service }} p95 latency increased by more than 20%"

        # Alert on error rate increase
        - alert: MigrationErrorRateIncrease
          expr: |
            (
              sum(rate(http_requests_total{status=~"5..", migrated="true"}[5m])) by (service)
              /
              sum(rate(http_requests_total{migrated="true"}[5m])) by (service)
            ) > 0.01
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Error rate exceeded 1% post-migration"

Grafana Dashboard for Migration


# Dashboard JSON snippet for migration comparison
{
  "panels": [
    {
      "title": "Request Latency Comparison",
      "type": "timeseries",
      "targets": [
        {
          "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{env='legacy'}[5m])) by (le))",
          "legendFormat": "Legacy p95"
        },
        {
          "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{env='openshift'}[5m])) by (le))",
          "legendFormat": "OpenShift p95"
        }
      ]
    },
    {
      "title": "Error Rate Comparison",
      "type": "stat",
      "targets": [
        {
          "expr": "sum(rate(http_requests_total{status=~'5..', env='openshift'}[5m])) / sum(rate(http_requests_total{env='openshift'}[5m]))",
          "legendFormat": "OpenShift Error Rate"
        }
      ]
    }
  ]
}

Log Aggregation for Troubleshooting


# Loki configuration for migration logs
apiVersion: loki.grafana.com/v1
kind: LokiStack
metadata:
  name: logging-loki
  namespace: openshift-logging
spec:
  size: 1x.small
  storage:
    schemas:
      - version: v12
        effectiveDate: "2024-01-01"
    secret:
      name: logging-loki-s3
      type: s3
  storageClassName: pure-fast
  tenants:
    mode: openshift-logging

CI/CD and GitOps: What Actually Works

Immutable Image Promotion

Do not rebuild images per environment. Build once, scan once, promote through environments.


# Tekton pipeline for build-once promotion
apiVersion: tekton.dev/v1beta1
kind: Pipeline
metadata:
  name: build-and-promote
spec:
  params:
    - name: git-revision
      type: string
  tasks:
    - name: build
      taskRef:
        name: buildah
      params:
        - name: IMAGE
          value: "registry.internal/app:$(params.git-revision)"

    - name: scan
      taskRef:
        name: trivy-scan
      runAfter:
        - build

    - name: sign
      taskRef:
        name: cosign-sign
      runAfter:
        - scan

    - name: promote-to-dev
      taskRef:
        name: skopeo-copy
      runAfter:
        - sign
      params:
        - name: srcImage
          value: "registry.internal/app:$(params.git-revision)"
        - name: destImage
          value: "registry.internal/app:dev"

If you rebuild per environment:

  • Debugging becomes impossible (which build has the bug?)
  • Security attestation is meaningless
  • Promotion is not promotion, it is a new deployment

ArgoCD Application Example


apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: app-prod
  namespace: openshift-gitops
spec:
  project: production
  destination:
    namespace: app-prod
    server: https://kubernetes.default.svc
  source:
    repoURL: https://github.com/org/app-config
    targetRevision: main
    path: overlays/prod
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=false
      - PrunePropagationPolicy=foreground
      - PruneLast=true
  ignoreDifferences:
    - group: apps
      kind: Deployment
      jsonPointers:
        - /spec/replicas  # Allow HPA to control replicas

Self-heal is not optional in regulated or audited environments. Manual drift is operational debt that compounds.

Environment Promotion with Kustomize


# Base kustomization
# base/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - deployment.yaml
  - service.yaml
  - configmap.yaml

# Production overlay
# overlays/prod/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - ../../base
patches:
  - patch: |
      - op: replace
        path: /spec/replicas
        value: 5
    target:
      kind: Deployment
      name: app
images:
  - name: app
    newName: registry.internal/app
    newTag: v1.2.3  # Pinned version, updated by CI

Rollback Strategy

Application-Level Rollback


# ArgoCD rollback to previous version
argocd app history app-prod
argocd app rollback app-prod <revision>

# Or using kubectl
kubectl rollout undo deployment/app -n app-prod
kubectl rollout status deployment/app -n app-prod

Traffic-Based Rollback


# OpenShift route for blue-green deployment
apiVersion: route.openshift.io/v1
kind: Route
metadata:
  name: app
  namespace: app-prod
spec:
  to:
    kind: Service
    name: app-green
    weight: 100
  alternateBackends:
    - kind: Service
      name: app-blue
      weight: 0
---
# To rollback, shift traffic back to blue
# oc patch route app -p '{"spec":{"to":{"weight":0},"alternateBackends":[{"kind":"Service","name":"app-blue","weight":100}]}}'

Full Migration Rollback

For critical systems, maintain the ability to roll back the entire migration for a defined period.


# Rollback checklist
rollback_criteria:
  - error_rate > 5% for 15 minutes
  - p99_latency > 2x baseline for 30 minutes
  - data_integrity_check_failed
  - critical_integration_broken

rollback_procedure:
  1. Announce rollback decision
  2. Stop writes to new system (if applicable)
  3. Verify data sync to legacy is current
  4. Switch DNS/load balancer to legacy
  5. Verify legacy system health
  6. Communicate rollback complete
  7. Schedule post-mortem

rollback_window: 14 days  # Maintain legacy systems for 2 weeks post-migration

Data Rollback Considerations


# Continuous data sync for rollback capability
apiVersion: batch/v1
kind: CronJob
metadata:
  name: data-sync-to-legacy
spec:
  schedule: "*/5 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: sync
              image: registry.internal/data-sync:latest
              env:
                - name: SOURCE_DB
                  value: "postgresql://new-db:5432/app"
                - name: TARGET_DB
                  value: "postgresql://legacy-db:5432/app"
                - name: SYNC_MODE
                  value: "incremental"
          restartPolicy: OnFailure

Key principle: Never decommission legacy systems until the rollback window has passed and stakeholders have signed off.


Migration Execution: What People Underestimate

State and Cutover

Databases and stateful services require parallel runs and controlled traffic switching. DNS TTLs must be reduced days in advance, not minutes.

Most outages during migration are caused by:

  • Hidden hard-coded IPs in application configs, scripts, and cron jobs
  • Legacy authentication dependencies (service accounts with IP-based trust)
  • Assumed local storage paths that do not exist in containers
  • Timezone differences between legacy VMs and containers (UTC default)
  • Environment variables that were set manually and never documented

Communication Plan


# Migration communication template
stakeholders:
  - business_owners
  - development_teams
  - operations
  - security
  - support

communications:
  - timing: T-14 days
    message: "Migration scheduled, review runbook"
    audience: all

  - timing: T-2 days
    message: "DNS TTL reduced, final validation"
    audience: operations, development

  - timing: T-0 (cutover)
    message: "Migration in progress, reduced SLA"
    audience: all

  - timing: T+1 hour
    message: "Initial validation complete"
    audience: all

  - timing: T+24 hours
    message: "Migration successful, monitoring continues"
    audience: all

Operational Testing (Non-Negotiable)

Before production:

  • Kill a control plane node and verify automatic recovery
  • Force etcd leader re-election during load
  • Simulate storage controller failure
  • Drain workers during peak load
  • Test certificate rotation
  • Verify backup and restore procedures
  • Run security scan and penetration test

# Chaos testing example with Litmus
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: control-plane-chaos
  namespace: litmus
spec:
  engineState: active
  appinfo:
    appns: openshift-etcd
    applabel: app=etcd
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "60"
            - name: CHAOS_INTERVAL
              value: "10"
            - name: FORCE
              value: "true"

If the platform team is afraid to do this, the cluster is not ready.


In short…

OpenShift migration is not a technology project. It is an operational transformation that happens to involve technology.

The patterns in this post exist because I have seen the alternatives fail. Every shortcut skipping reservations, ignoring kernel tuning, compressing testing phases creates debt that surfaces as production incidents.

Key principles:

  • Infrastructure must be deterministic before OpenShift installation
  • Security is architecture, not an afterthought
  • Migration strategy matters more than migration speed
  • Observability validates success; without baselines, you are guessing
  • Rollback capability is not optional for production systems
  • Test failure modes before they test you

The goal is not to move workloads. The goal is to move workloads without moving your problems with them and without creating new ones.

How to Deploy Kubernetes on AWS the Scalable Way

Kubernetes has become the de facto standard for orchestrating containerized workloads—but deploying it correctly on AWS requires more than just spinning up an EKS cluster. You need to think about scalability, cost-efficiency, security, and high availability from day one.

In this guide, we’ll walk you through how to deploy a scalable, production-grade Kubernetes environment on AWS—step by step.

Why Kubernetes on AWS?

Amazon Web Services offers powerful tools to run Kubernetes at scale, including:

  • Amazon EKS – Fully managed control plane
  • EC2 Auto Scaling Groups – Dynamic compute scaling
  • Elastic Load Balancer (ELB) – Handles incoming traffic
  • IAM Roles for Service Accounts – Fine-grained access control
  • Fargate (Optional) – Run pods without managing servers

Step-by-Step Deployment Plan

1. Plan the Architecture

Your Kubernetes architecture should be:

  • Highly Available (Multi-AZ)
  • Scalable (Auto-scaling groups)
  • Secure (Private networking, IAM roles)
  • Observable (Monitoring, logging)
+---------------------+
|   Route 53 / ALB    |
+----------+----------+
           |
   +-------v-------+
   |  EKS Control   |
   |    Plane       |  <- Managed by AWS
   +-------+--------+
           |
+----------v----------+
|   EC2 Worker Nodes   |  <- Auto-scaling
|  (in Private Subnet) |
+----------+-----------+
           |
   +-------v--------+
   |  Kubernetes     |
   |  Workloads      |
   +-----------------+

2. Provision Infrastructure with IaC (Terraform)

Use Terraform to define your VPC, subnets, security groups, and EKS cluster:

module "eks" {
  source          = "terraform-aws-modules/eks/aws"
  cluster_name    = "my-cluster"
  cluster_version = "1.29"
  subnets         = module.vpc.private_subnets
  vpc_id          = module.vpc.vpc_id
  manage_aws_auth = true

  node_groups = {
    default = {
      desired_capacity = 3
      max_capacity     = 6
      min_capacity     = 1
      instance_type    = "t3.medium"
    }
  }
}

Security Tip: Keep worker nodes in private subnets and expose only your load balancer to the public internet.

3. Set Up Cluster Autoscaler

Install the Kubernetes Cluster Autoscaler to automatically scale your EC2 nodes:

kubectl apply -f cluster-autoscaler-autodiscover.yaml

Ensure the autoscaler has IAM permissions via IRSA (IAM Roles for Service Accounts).

4. Use Horizontal Pod Autoscaler

Use HPA to scale pods based on resource usage:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: myapp-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

5. Implement CI/CD Pipelines

Use tools like Argo CD, Flux, or GitHub Actions:

- name: Deploy to EKS
  uses: aws-actions/amazon-eks-deploy@v1
  with:
    cluster-name: my-cluster
    kubectl-version: '1.29'

6. Set Up Observability

Install:

  • Prometheus + Grafana for metrics
  • Fluent Bit or Loki for logging
  • Kube-State-Metrics for cluster state
  • AWS CloudTrail and GuardDuty for security monitoring

7. Optimize Costs

  • Use Spot Instances with on-demand fallback
  • Use EC2 Mixed Instance Policies
  • Try Graviton (ARM) nodes for better cost-performance ratio

Bonus: Fargate Profiles for Microservices

For small or bursty workloads, use AWS Fargate to run pods serverlessly:

eksctl create fargateprofile \
  --cluster my-cluster \
  --name fp-default \
  --namespace default

Recap Checklist

  • Multi-AZ VPC with private subnets
  • Terraform-managed EKS cluster
  • Cluster and pod auto-scaling enabled
  • CI/CD pipeline in place
  • Observability stack (metrics/logs/security)
  • Spot instances or Fargate to save costs

Deploying Kubernetes on AWS at scale doesn’t have to be complex—but it does need a solid foundation. Use managed services where possible, automate everything, and focus on observability and security from the start.

If you’re looking for a production-grade, scalable deployment, Terraform + EKS + autoscaling is your winning combo.

Fixing Read-Only Mode on eLux Thin Clients

Fixing Read-Only Mode on eLux Thin Clients

If your eLux device boots into a read-only filesystem or prevents saving changes, it’s usually due to the write filter or system protection settings. Here’s how to identify and fix the issue.

Common Causes

  • Write Filter is enabled (RAM overlay by default)
  • System partition is locked as part of image protection
  • Corrupted overlay from improper shutdown

Fix 1: Temporarily Remount as Read/Write

sudo mount -o remount,rw /

This allows you to make temporary changes. They will be lost after reboot unless you adjust the image or profile settings.

Fix 2: Enable Persistent Mode via the EIS Tool

  1. Open your image project in the EIS Tool
  2. Go to the Settings tab
  3. Locate the write filter or storage persistence section
  4. Set it to Persistent Storage
  5. Export the updated image and redeploy

Fix 3: Enable Persistence via Scout Configuration Profile

  1. Open Scout Enterprise Console
  2. Go to Configuration > Profiles
  3. Edit the assigned profile
  4. Enable options like:
    • Persistent user data
    • Persistent certificate storage
    • Persistent logging
  5. Save and reassign the profile

Fix 4: Reimage the Device

  • If the system is damaged or stuck in read-only permanently, use a USB stick or PXE deployment to reflash the device.
  • Ensure the new image has persistence enabled in the EIS Tool before deploying.

Check Filesystem Mount Status

mount | grep ' / '

If you see (ro) in the output, the system is in read-only mode.

Final Notes

  • eLux protects system partitions by design — use Scout and EIS Tool to make lasting changes
  • Remounting manually is fine for diagnostics but not a long-term fix
  • Always test changes on a test device before rolling out to production

Elux Image Deployment

How to Create and Deploy a Custom eLux Image at Scale

This guide is intended for Linux/VDI system administrators managing eLux thin clients across enterprise environments. It covers:

  • Part 1: Creating a fresh, customized eLux image
  • Part 2: Deploying the image at scale using Scout Enterprise

Part 1: Creating a Custom eLux Image with Tailored Settings

Step 1: Download Required Files

  1. Go to https://www.myelux.com and log in.
  2. Download the following:
    • Base OS image (e.g., elux-RP6-base.ufi)
    • Module files (.ulc) – Citrix, VMware, Firefox, etc.
    • EIS Tool (eLux Image Stick Tool) for your admin OS

Step 2: Install and Open the EIS Tool

  1. Install the EIS Tool on a Windows or Linux system.
  2. Launch the tool and click New Project.
  3. Select the downloaded .ufi base image.
  4. Name your project (e.g., elux-custom-v1) and confirm.

Step 3: Add or Remove Modules

  1. Go to the Modules tab inside the EIS Tool.
  2. Click Add and import the required .ulc files.
  3. Deselect any modules you don’t need.
  4. Click Apply to save module selections.

Step 4: Modify System Settings (Optional)

  • Set default screen resolution
  • Enable or disable write protection
  • Choose RAM overlay or persistent storage
  • Enable shell access if needed for support
  • Disable unneeded services

Step 5: Export the Image

  • To USB stick:
    Click "Write to USB Stick"
    Select your USB target drive
  • To file for network deployment:
    Click "Export Image"
    Save your customized .ufi (e.g., elux-custom-v1.ufi)

Part 2: Deploying the Custom Image at Scale Using Scout Enterprise

Step 1: Import the Image into Scout

  1. Open Scout Enterprise Console
  2. Navigate to Repository > Images
  3. Right-click → Import Image
  4. Select the .ufi file created earlier

Step 2: Create and Configure a Profile

  1. Go to Configuration > Profiles
  2. Click New Profile
  3. Configure network, session, and UI settings
  4. Save and name the profile (e.g., Citrix-Kiosk-Profile)

Step 3: Assign Image and Profile to Devices or Groups

  1. Navigate to Devices or Groups
  2. Right-click → Assign OS Image
  3. Select your custom .ufi
  4. Right-click → Assign Profile
  5. Select your configuration profile

Step 4: Deploy the Image

Option A: PXE Network Deployment

  • Enable PXE boot on client devices (via BIOS)
  • Ensure PXE services are running (Scout or custom)
  • On reboot, clients auto-deploy image and config

Option B: USB Stick Installation

  • Boot client device from prepared USB stick
  • Follow on-screen instructions to install
  • Device registers and pulls config from Scout

Step 5: Monitor Deployment

  • Use Logs > Job Queue to track installations
  • Search for devices to confirm version and status

Optional Commands

Inspect or Write Images

# Mount .ufi image (read-only)
sudo mount -o loop elux-custom.ufi /mnt/elux

# Write image to USB on Linux
sudo dd if=elux-custom.ufi of=/dev/sdX bs=4M status=progress

Manual PXE Server Setup (Linux)

sudo apt install tftpd-hpa dnsmasq

# Example dnsmasq.conf
port=0
interface=eth0
dhcp-range=192.168.1.100,192.168.1.200,12h
dhcp-boot=pxelinux.0
enable-tftp
tftp-root=/srv/tftp

sudo systemctl restart tftpd-hpa
dsudo systemctl restart dnsmasq

Commands on eLux Device Shell

# Switch to shell (Ctrl+Alt+F1), then:
uname -a
df -h
scout showconfig
scout pullconfig

Summary

Task Tool
Build custom image EIS Tool
Add/remove software modules .ulc files + EIS Tool
Customize settings EIS Tool + Scout Profile
Deploy to all clients PXE boot or USB + Scout
Manage and monitor at scale Scout Enterprise Console

How to Deploy Lustre with ZFS Backend (RDMA, ACLs, Nodemaps, Clients

 

This step-by-step guide walks you through deploying a production-ready Lustre filesystem backed by ZFS, including RDMA networking, MDT/OST setup, nodemaps, ACL configuration, and client mounting. This guide assumes:

  • MGS + MDS on one node
  • One or more OSS nodes
  • Clients mounting over RDMA (o2ib)
  • ZFS as the backend filesystem

0. Architecture & Assumptions

  • Filesystem name: lustrefs
  • MGS/MDS RDMA IP: 172.16.0.10
  • OSS RDMA IP: 172.16.0.20
  • Client RDMA IP: 172.16.0.30
  • RDMA interface: ib0
  • Network type: o2ib

1. Manager Server Setup (MGS + MDS with ZFS)

1.1 Install ZFS and Lustre MDS packages

sudo apt update
sudo apt install -y zfsutils-linux
sudo apt install -y lustre-osd-zfs-mount lustre-utils

1.2 Create a ZFS pool for MDT

sudo zpool create mdtpool mirror /dev/nvme0n1 /dev/nvme1n1 ashift=12
sudo zfs create -o recordsize=4K -o primarycache=metadata mdtpool/mdt0

1.3 Format MDT & enable MGS

sudo mkfs.lustre \
  --fsname=lustrefs \
  --mgs \
  --mdt \
  --index=0 \
  --backfstype=zfs mdtpool/mdt0

1.4 Mount MDT

sudo mkdir -p /mnt/mdt0
sudo mount -t lustre mdtpool/mdt0 /mnt/mdt0

2. RDMA + LNET Configuration (All Nodes)

2.1 Install RDMA core utilities

sudo apt install -y rdma-core

2.2 Bring up the RDMA interface

sudo ip addr add 172.16.0.10/24 dev ib0
sudo ip link set ib0 up

2.3 Configure LNET to use o2ib

Create /etc/modprobe.d/lustre.conf:

options lnet networks="o2ib(ib0)"

Load and enable LNET

sudo modprobe lnet
sudo systemctl enable lnet
sudo systemctl start lnet
sudo lctl list_nids

3. OFED / Mellanox Optional Performance Tuning

These settings are optional but recommended for high-performance Lustre deployments using Mellanox or OFED-based InfiniBand hardware.

3.1 Relevant config locations

  • /etc/infiniband/*
  • /etc/modprobe.d/mlx5.conf
  • /etc/security/limits.d/rdma.conf
  • /etc/sysctl.conf (MTU, hugepages, buffers)
  • /etc/rdma/modules/

3.2 Increase RDMA MTU (InfiniBand)

sudo ip link set ib0 mtu 65520

3.3 Increase RDMA network buffers

echo 262144 | sudo tee /proc/sys/net/core/rmem_max
echo 262144 | sudo tee /proc/sys/net/core/wmem_max

These settings improve performance when using high-speed links (56Gb, 100Gb, HDR100, etc.).


4. OSS Node Setup (ZFS + OSTs)

4.1 Install ZFS + Lustre OSS components

sudo apt update
sudo apt install -y zfsutils-linux lustre-osd-zfs-mount lustre-utils

4.2 Create an OST ZFS pool

sudo zpool create ostpool raidz2 \
    /dev/sdc /dev/sdd /dev/sde /dev/sdf ashift=12

sudo zfs create -o recordsize=1M ostpool/ost0

4.3 Format OST using RDMA to MGS

sudo mkfs.lustre \
  --fsname=lustrefs \
  --ost \
  --index=0 \
  --mgsnode=172.16.0.10@o2ib \
  --backfstype=zfs ostpool/ost0

4.4 Mount OST

sudo mkdir -p /mnt/ost0
sudo mount -t lustre ostpool/ost0 /mnt/ost0

5. Client Node Setup

5.1 Install Lustre client packages

sudo apt update
sudo apt install -y lustre-client-modules-$(uname -r) lustre-utils

If you setup MGS/OSS and client correctly when you mount from the client 




5.2 Configure RDMA + LNET (same as above)

sudo ip addr add 172.16.0.30/24 dev ib0
sudo ip link set ib0 up

echo 'options lnet networks="o2ib(ib0)"' | sudo tee /etc/modprobe.d/lustre.conf

sudo modprobe lnet
sudo systemctl start lnet
sudo lctl list_nids

6. How to Get Lustre Target Names

List OSTs

lfs osts

List MDTs

lfs mdts

List all targets and connections

lctl dl

Check space and OST availability

lfs df -h

7. Nodemap Configuration (Access Control)

7.1 Create and enable default nodemap

sudo lctl nodemap_add default
sudo lctl nodemap_modify default --property enable=1
sudo lctl nodemap_modify default --property map_mode=identity

7.2 Restrict access to an RDMA subnet

sudo lctl nodemap_modify default --add ranges=172.16.0.0@o2ib/24

7.3 Make a subnet read-only (optional)

sudo lctl nodemap_modify default --property readonly=true

8. ACL Configuration (ZFS + Lustre)

8.1 Enable ACL support in ZFS (MDT)

sudo zfs set acltype=posixacl mdtpool/mdt0
sudo zfs set xattr=sa mdtpool/mdt0
sudo zfs set compression=off mdtpool/mdt0

8.2 Enable ACLs in Lustre

sudo lctl set_param mdt.*.enable_acls=1

8.3 Use ACLs from clients

sudo setfacl -m u:alice:rwx /mnt/lustre/data
getfacl /mnt/lustre/data

9. Mounting Lustre on Clients (Over RDMA)

9.1 Mount command

sudo mkdir -p /mnt/lustre

sudo mount -t lustre \
  172.16.0.10@o2ib:/lustrefs \
  /mnt/lustre

example without ibnetwork
[root@vbox ~]# mount -t lustre 192.168.50.5@tcp:/lustre /mnt/lustre-client
[root@vbox ~]# 
[root@vbox ~]# # Verify the mount worked
[root@vbox ~]# df -h /mnt/lustre-client
Filesystem                Size  Used Avail Use% Mounted on
192.168.50.5@tcp:/lustre   12G  2.5M   11G   1% /mnt/lustre-client
[root@vbox ~]# lfs df -h
UUID                       bytes        Used   Available Use% Mounted on
lustre-MDT0000_UUID         4.5G        1.9M        4.1G   1% /mnt/lustre-client[MDT:0]
lustre-OST0000_UUID         7.5G        1.2M        7.0G   1% /mnt/lustre-client[OST:0]
lustre-OST0001_UUID         3.9G        1.2M        3.7G   1% /mnt/lustre-client[OST:1]
filesystem_summary:        11.4G        2.4M       10.7G   1% /mnt/lustre-client

9.2 Verify the mount

df -h /mnt/lustre
lfs df -h

9.3 Persistent fstab entry

172.16.0.10@o2ib:/lustrefs  /mnt/lustre  lustre  _netdev,defaults  0 0

10. Summary of the Correct Order

  1. Install ZFS + Lustre on MGS/MDS
  2. Create MDT ZFS dataset & format MDT+MGS
  3. Configure RDMA + LNET
  4. Apply optional OFED/Mellanox tuning
  5. Install ZFS + Lustre on OSS, create OSTs
  6. Format and mount OSTs
  7. Install Lustre client packages
  8. Mount client via RDMA
  9. Retrieve target names (OST/MDT)
  10. Configure nodemaps
  11. Configure ACLs

Final Notes

You now have a complete ZFS-backed Lustre filesystem with RDMA transport, OFED/Mellanox tunings, ACLs, and nodemaps. This layout provides parallel filesystem HIGH grade performance and clean scalability.

Note: I have also created a ansible-playbook that can deploy this across clients and test everything; its currently not a public repo; email me at support@nicktailor.com. If you like to hire me to set it up.

├── inventory/

│   └── hosts.yml              # Inventory file with host definitions

├── group_vars/

│   └── all.yml                # Global variables

├── roles/

│   ├── infiniband/            # InfiniBand/RDMA setup

│   ├── zfs/                   # ZFS installation and configuration

│   ├── lustre_mgs_mds/        # MGS/MDS server setup

│   ├── lustre_oss/            # OSS server setup

│   ├── lustre_client/         # Client setup

│   ├── lustre_nodemaps/       # Nodemap configuration

│   └── lustre_acls/           # ACL configuration

├── site.yml                   # Main deployment playbook

├── test_connectivity.yml      # Connectivity testing playbook

└── README.md                  # This file

Key Components for Setting Up an HPC Cluster

Head Node (Controller)

 Manages job scheduling and resource allocation.
 Runs Slurm Controller Daemon (`slurmctld`).

Compute Nodes

 Execute computational tasks.
 Run Slurm Node Daemon (`slurmd`).
 Configured for CPUs, GPUs, or specialized hardware.

Networking

 High-speed interconnect like Infiniband or Ethernet.
 Ensures fast communication between nodes.

Storage

 Centralized storage like NFS, Lustre, or BeeGFS.
 Provides shared file access for all nodes.

Authentication

 Use Munge for secure communication between Slurm components.

Scheduler

 Slurm for job scheduling and resource management.
 Configured with partitions and node definitions.

Resource Management

 Use cgroups to control CPU, memory, and GPU usage.
 Optional: ProctrackType=cgroup in Slurm.

Parallel File System (Optional)

 High-performance shared storage for parallel workloads.
 Examples: Lustre, GPFS.

Interconnect Libraries

 MPI (Message Passing Interface) for distributed computing.
 Install libraries like OpenMPI or MPICH.

Monitoring and Debugging Tools

 Tools like Prometheus, Grafana, or Ganglia for resource monitoring.
 Enable verbose logging in Slurm for debugging.

How to configure Slurm Controller Node on Ubuntu 22.04

How to setup HPC-Slurm Controller Node

Refer to Key Components for HPC Cluster Setup; for which pieces you need to setup.

This guide provides step-by-step instructions for setting up the Slurm controller daemon (`slurmctld`) on Ubuntu 22.04. It also includes common errors encountered during the setup process and how to resolve them.

Step 1: Install Prerequisites

To begin, install the required dependencies for Slurm and its components:

sudo apt update && sudo apt upgrade -y
sudo apt install -y munge libmunge-dev libmunge2 build-essential man-db mariadb-server mariadb-client libmariadb-dev python3 python3-pip chrony

Step 2: Configure Munge (Authentication for slurm)

Munge is required for authentication within the Slurm cluster.

1. Generate a Munge key on the controller node:
sudo create-munge-key

2. Copy the key to all compute nodes:
scp /etc/munge/munge.key user@node:/etc/munge/

3. Start the Munge service:
sudo systemctl enable –now munge

Step 3: Install Slurm

1. Download and compile Slurm:
wget https://download.schedmd.com/slurm/slurm-23.02.4.tar.bz2
tar -xvjf slurm-23.02.4.tar.bz2
cd slurm-23.02.4
./configure –prefix=/usr/local/slurm –sysconfdir=/etc/slurm
make -j$(nproc)
sudo make install

2. Create necessary directories and set permissions:
sudo mkdir -p /etc/slurm /var/spool/slurm /var/log/slurm
sudo chown slurm: /var/spool/slurm /var/log/slurm

3. Add the Slurm user:
sudo useradd -m slurm

Step 4: Configure Slurm; more complex configs contact Nick Tailor

1. Generate a basic `slurm.conf` using the configurator tool at
https://slurm.schedmd.com/configurator.html. Save the configuration to `/etc/slurm/slurm.conf`.

# Basic Slurm Configuration

ClusterName=my_cluster

ControlMachine=slurmctld # Replace with your control node’s hostname

# BackupController=backup-slurmctld # Uncomment and replace if you have a backup controller

.

# Authentication

AuthType=auth/munge

CryptoType=crypto/munge

.

# Logging

SlurmdLogFile=/var/log/slurm/slurmd.log

SlurmctldLogFile=/var/log/slurm/slurmctld.log

SlurmctldDebug=info

SlurmdDebug=info

.

# Slurm User

SlurmUser=slurm

StateSaveLocation=/var/spool/slurm

SlurmdSpoolDir=/var/spool/slurmd

.

# Scheduler

SchedulerType=sched/backfill

SchedulerParameters=bf_continue

.

# Accounting

AccountingStorageType=accounting_storage/none

JobAcctGatherType=jobacct_gather/linux

.

# Compute Nodes

NodeName=node[1-2] CPUs=4 RealMemory=8192 State=UNKNOWN

PartitionName=debug Nodes=node[1-2] Default=YES MaxTime=INFINITE State=UP

2. Distribute `slurm.conf` to all compute nodes:
scp /etc/slurm/slurm.conf user@node:/etc/slurm/

3. Restart Slurm services:
sudo systemctl restart slurmctld
sudo systemctl restart slurmd

Troubleshooting Common Errors

.

root@slrmcltd:~# tail /var/log/slurm/slurmctld.log

[2024-12-06T11:57:25.428] error: High latency for 1000 calls to gettimeofday(): 20012 microseconds

[2024-12-06T11:57:25.431] fatal: mkdir(/var/spool/slurm): Permission denied

[2024-12-06T11:58:34.862] error: High latency for 1000 calls to gettimeofday(): 20029 microseconds

[2024-12-06T11:58:34.864] fatal: mkdir(/var/spool/slurm): Permission denied

[2024-12-06T11:59:38.843] error: High latency for 1000 calls to gettimeofday(): 18842 microseconds

[2024-12-06T11:59:38.847] fatal: mkdir(/var/spool/slurm): Permission denied

Error: Permission Denied for /var/spool/slurm

This error occurs when the `slurm` user does not have the correct permissions to access the directory.

Fix:
sudo mkdir -p /var/spool/slurm
sudo chown -R slurm: /var/spool/slurm
sudo chmod -R 755 /var/spool/slurm

Error: Temporary Failure in Name Resolution

Slurm could not resolve the hostname `slurmctld`. This can be fixed by updating `/etc/hosts`:

1. Edit `/etc/hosts` and add the following:
127.0.0.1
slurmctld
192.168.20.8
slurmctld

2. Verify the hostname matches `ControlMachine` in `/etc/slurm/slurm.conf`.

3. Restart networking and test hostname resolution:
sudo systemctl restart systemd-networkd
ping slurmctld

Error: High Latency for gettimeofday()

Dec 06 11:57:25 slrmcltd.home systemd[1]: Started Slurm controller daemon.

Dec 06 11:57:25 slrmcltd.home slurmctld[2619]: slurmctld: error: High latency for 1000 calls to gettimeofday(): 20012 microseconds

Dec 06 11:57:25 slrmcltd.home systemd[1]: slurmctld.service: Main process exited, code=exited, status=1/FAILURE

Dec 06 11:57:25 slrmcltd.home systemd[1]: slurmctld.service: Failed with result ‘exit-code’.

This warning typically indicates timing issues in the system.

Fixes:
1. Install and configure `
chrony` for time synchronization:
sudo apt install chrony
sudo systemctl enable –now chrony
   chronyc tracking
timedatectl
2. For virtualized environments, optimize the clocksource:
sudo echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource

3. Disable high-precision timing in `slurm.conf` (optional):
HighPrecisionTimer=NO
sudo systemctl restart slurmctld

Step 5: Verify and Test the Setup

1. Validate the configuration:
scontrol reconfigure
– no errors mean its working. If this doesn’t work check the connection between nodes
update your /etc/hosts to have the hosts all listed across the all machines and nodes.

2. Check node and partition status:
sinfo

root@slrmcltd:/etc/slurm# sinfo

PARTITION AVAIL TIMELIMIT NODES STATE NODELIST

debug* up infinite 1 idle* node1

3. Monitor logs for errors:
sudo tail -f /var/log/slurm/slurmctld.log

.

Written By: Nick Tailor

Kubernetes Cheat Sheet

kubectl Context and Configuration

Manage which Kubernetes cluster kubectl communicates with, and configure authentication and namespace defaults.

kubectl config view                               # View merged kubeconfig

# Use multiple kubeconfig files simultaneously
export KUBECONFIG=~/.kube/config:~/.kube/kubconfig2
kubectl config view

# Extract a specific user's password
kubectl config view -o jsonpath='{.users[?(@.name == "e2e")].user.password}'

# List users
kubectl config view -o jsonpath='{.users[*].name}'

# Context management
kubectl config get-contexts                        # List contexts
kubectl config current-context                     # Show active context
kubectl config use-context my-cluster              # Switch context

# Add a cluster entry
kubectl config set-cluster my-cluster

# Set proxy URL for cluster entry
kubectl config set-cluster my-cluster --proxy-url=my-proxy-url

# Add a user with basic authentication
kubectl config set-credentials kubeuser/foo.kubernetes.com \
  --username=kubeuser --password=kubepassword

# Set default namespace for current context
kubectl config set-context --current --namespace=production

# Set a new context with specific namespace and user
kubectl config set-context gce --user=cluster-admin --namespace=foo \
  && kubectl config use-context gce

# Delete a user
kubectl config unset users.foo

Helpful aliases:

# Quickly switch or show context
alias kx='f() { [ "$1" ] && kubectl config use-context $1 || kubectl config current-context ; } ; f'

# Quickly switch or show namespace
alias kn='f() { [ "$1" ] && kubectl config set-context --current --namespace $1 \
  || kubectl config view --minify | grep namespace | cut -d" " -f6 ; } ; f'

kubectl apply (Declarative Management)

kubectl apply is the recommended method for managing resources in production. It creates or updates resources by applying a desired state.

kubectl apply -f ./app.yaml                         # Apply single file
kubectl apply -f ./manifests/                       # Apply directory
kubectl apply -f https://example.com/app.yaml       # Apply from URL

kubectl create deployment nginx --image=nginx       # Quick one-shot deployment

Create multiple manifests via stdin:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: pod-one
spec:
  containers:
  - name: c
    image: busybox
    args: ["sleep", "1000"]
---
apiVersion: v1
kind: Pod
metadata:
  name: pod-two
spec:
  containers:
  - name: c
    image: busybox
    args: ["sleep", "2000"]
EOF

Create a secret:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Secret
metadata:
  name: mysecret
type: Opaque
data:
  username: $(echo -n "jane" | base64 -w0)
  password: $(echo -n "s33msi4" | base64 -w0)
EOF

Viewing and Finding Resources

kubectl get pods                                   # Pods in namespace
kubectl get pods -A                                # All namespaces
kubectl get pods -o wide                           # Pod node placement
kubectl get deployments                            # Deployments
kubectl get svc                                     # Services
kubectl describe pod my-pod                         # Detailed pod info
kubectl describe node my-node                       # Node details

Sorting:

kubectl get pods --sort-by='.status.containerStatuses[0].restartCount'
kubectl get pv --sort-by=.spec.capacity.storage

Field and label selectors:

kubectl get pods --field-selector=status.phase=Running
kubectl get pods -l app=web
kubectl get nodes --selector='!node-role.kubernetes.io/control-plane'

Retrieve specific fields:

kubectl get configmap myconfig -o jsonpath='{.data.ca\.crt}'
kubectl get secret my-secret -o jsonpath='{.data.username}' | base64 --decode

Updating Resources and Rolling Updates

kubectl set image deployment/web web=nginx:1.25          # Update image
kubectl rollout history deployment/web                    # View history
kubectl rollout undo deployment/web                       # Roll back
kubectl rollout restart deployment/web                    # Rolling restart
kubectl rollout status deployment/web                     # Watch rollout

Patching Resources

kubectl patch node node1 -p '{"spec": {"unschedulable": true}}'

# Strategic merge patch
kubectl patch pod app-pod -p '{
  "spec": {"containers":[{"name":"app","image":"new-image"}]}
}'

# JSON patch
kubectl patch pod app-pod --type=json -p='[
  {"op":"replace","path":"/spec/containers/0/image","value":"new-image"}
]'

Editing Resources

kubectl edit svc/web-service
KUBE_EDITOR="nano" kubectl edit deployment/web

Change between:
ClusterIP
NodePort
LoadBalancer
ExternalName
Port
Targetport
NodePort
Protocol

Scaling Resources

kubectl scale deployment/web --replicas=5
kubectl scale -f deployment.yaml --replicas=4

Deleting Resources

kubectl delete -f ./app.yaml
kubectl delete pod my-pod --now
kubectl delete pods,svc -l app=web
kubectl delete pod,svc --all -n test

Interacting With Running Pods

kubectl logs my-pod
kubectl logs -f my-pod
kubectl exec my-pod -- ls /
kubectl exec -it my-pod -- sh
kubectl port-forward svc/web 8080:80

Copying Files to and from Containers

kubectl cp /tmp/localfile my-pod:/tmp/remote
kubectl cp my-pod:/tmp/remote /tmp/localfile

Advanced (using tar):

tar cf - . | kubectl exec -i my-pod -- tar xf - -C /tmp

Interacting With Nodes and Cluster

kubectl cordon node1
kubectl drain node1
kubectl uncordon node1

kubectl top node
kubectl top pod

kubectl cluster-info
kubectl cluster-info dump

Discovering API Resources

kubectl api-resources
kubectl api-resources --namespaced=true
kubectl api-resources -o wide
kubectl api-resources --verbs=list,get

Kubectl Output Formatting

kubectl get pods -o json
kubectl get pods -o yaml
kubectl get pods -o wide
kubectl get pods -o name
kubectl get pods -o jsonpath='{.items[*].metadata.name}'

Custom columns:

kubectl get pods -A -o=custom-columns='IMAGE:spec.containers[*].image'

Kubectl Verbosity and Debugging

  • –v=0 Minimal logs
  • –v=2 Recommended default
  • –v=4 Debug level
  • –v=6+ Full HTTP request inspection

Production-Ready Deployment YAML (Corrected)

Below is a cleaned-up and production-ready Deployment YAML based on your original example.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  namespace: my-namespace
  labels:
    app: nginx
spec:
  replicas: 3
  revisionHistoryLimit: 5
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        fsGroup: 1000
      containers:
        - name: nginx
          image: nginx:1.25
          ports:
            - containerPort: 80
          resources:
            requests:
              cpu: "100m"
              memory: "128Mi"
            limits:
              cpu: "300m"
              memory: "256Mi"
          readinessProbe:
            httpGet:
              path: /
              port: 80
            initialDelaySeconds: 3
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /
              port: 80
            initialDelaySeconds: 10
            periodSeconds: 20
          securityContext:
            readOnlyRootFilesystem: true
            allowPrivilegeEscalation: false

Conclusion

Kubernetes Cheat Sheet

This Kubernetes cheat sheet is a comprehensive and practical reference for working with kubectl, managing kubeconfig files, deploying Kubernetes workloads, viewing and troubleshooting cluster resources, and interacting with running workloads. It also includes a corrected production-ready Deployment YAML example. Everything below is ready to copy and paste directly into your WordPress editor.


kubectl Context and Configuration

Manage which Kubernetes cluster kubectl communicates with, and configure authentication and namespace defaults.

kubectl config view                               # View merged kubeconfig
# Use multiple kubeconfig files simultaneously
export KUBECONFIG=~/.kube/config:~/.kube/kubconfig2
kubectl config view
# Extract a specific user's password
kubectl config view -o jsonpath='{.users[?(@.name == "e2e")].user.password}'
# List users
kubectl config view -o jsonpath='{.users[*].name}'
# Context management
kubectl config get-contexts                        # List contexts
kubectl config current-context                     # Show active context
kubectl config use-context my-cluster              # Switch context
# Add a cluster entry
kubectl config set-cluster my-cluster
# Set proxy URL for cluster entry
kubectl config set-cluster my-cluster --proxy-url=my-proxy-url
# Add a user with basic authentication
kubectl config set-credentials kubeuser/foo.kubernetes.com \
  --username=kubeuser --password=kubepassword
# Set default namespace for current context
kubectl config set-context --current --namespace=production
# Set a new context with specific namespace and user
kubectl config set-context gce --user=cluster-admin --namespace=foo \
  && kubectl config use-context gce
# Delete a user
kubectl config unset users.foo

Helpful aliases:

# Quickly switch or show context
alias kx='f() { [ "$1" ] && kubectl config use-context $1 || kubectl config current-context ; } ; f'
# Quickly switch or show namespace
alias kn='f() { [ "$1" ] && kubectl config set-context --current --namespace $1 \
  || kubectl config view --minify | grep namespace | cut -d" " -f6 ; } ; f'

kubectl apply (Declarative Management)

kubectl apply is the recommended method for managing resources in production. It creates or updates resources by applying a desired state.

kubectl apply -f ./app.yaml                         # Apply single file
kubectl apply -f ./manifests/                       # Apply directory
kubectl apply -f https://example.com/app.yaml       # Apply from URL
kubectl create deployment nginx --image=nginx       # Quick one-shot deployment

Create multiple manifests via stdin:

cat <

Create a secret:

cat <

Viewing and Finding Resources

kubectl get pods                                   # Pods in namespace
kubectl get pods -A                                # All namespaces
kubectl get pods -o wide                           # Pod node placement
kubectl get deployments                            # Deployments
kubectl get svc                                     # Services
kubectl describe pod my-pod                         # Detailed pod info
kubectl describe node my-node                       # Node details

Sorting:

kubectl get pods --sort-by='.status.containerStatuses[0].restartCount'
kubectl get pv --sort-by=.spec.capacity.storage

Field and label selectors:

kubectl get pods --field-selector=status.phase=Running
kubectl get pods -l app=web
kubectl get nodes --selector='!node-role.kubernetes.io/control-plane'

Retrieve specific fields:

kubectl get configmap myconfig -o jsonpath='{.data.ca\.crt}'
kubectl get secret my-secret -o jsonpath='{.data.username}' | base64 --decode

Updating Resources and Rolling Updates

kubectl set image deployment/web web=nginx:1.25          # Update image
kubectl rollout history deployment/web                    # View history
kubectl rollout undo deployment/web                       # Roll back
kubectl rollout restart deployment/web                    # Rolling restart
kubectl rollout status deployment/web                     # Watch rollout

Patching Resources

kubectl patch node node1 -p '{"spec": {"unschedulable": true}}'
# Strategic merge patch
kubectl patch pod app-pod -p '{
  "spec": {"containers":[{"name":"app","image":"new-image"}]}
}'
# JSON patch
kubectl patch pod app-pod --type=json -p='[
  {"op":"replace","path":"/spec/containers/0/image","value":"new-image"}
]'

Editing Resources

kubectl edit svc/web-service
KUBE_EDITOR="nano" kubectl edit deployment/web

Scaling Resources

kubectl scale deployment/web --replicas=5
kubectl scale -f deployment.yaml --replicas=4

Deleting Resources

kubectl delete -f ./app.yaml
kubectl delete pod my-pod --now
kubectl delete pods,svc -l app=web
kubectl delete pod,svc --all -n test

Interacting With Running Pods

kubectl logs my-pod
kubectl logs -f my-pod
kubectl exec my-pod -- ls /
kubectl exec -it my-pod -- sh
kubectl port-forward svc/web 8080:80

Copying Files to and from Containers

kubectl cp /tmp/localfile my-pod:/tmp/remote
kubectl cp my-pod:/tmp/remote /tmp/localfile

Advanced (using tar):

tar cf - . | kubectl exec -i my-pod -- tar xf - -C /tmp

Interacting With Nodes and Cluster

kubectl cordon node1
kubectl drain node1
kubectl uncordon node1
kubectl top node
kubectl top pod
kubectl cluster-info
kubectl cluster-info dump

Discovering API Resources

kubectl api-resources
kubectl api-resources --namespaced=true
kubectl api-resources -o wide
kubectl api-resources --verbs=list,get

Kubectl Output Formatting

kubectl get pods -o json
kubectl get pods -o yaml
kubectl get pods -o wide
kubectl get pods -o name
kubectl get pods -o jsonpath='{.items[*].metadata.name}'

Custom columns:

kubectl get pods -A -o=custom-columns='IMAGE:spec.containers[*].image'

Kubectl Verbosity and Debugging

  • –v=0 Minimal logs
  • –v=2 Recommended default
  • –v=4 Debug level
  • –v=6+ Full HTTP request inspection

Production-Ready Deployment YAML (Corrected)

Below is a cleaned-up and production-ready Deployment YAML based on your original example.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  namespace: my-namespace
  labels:
    app: nginx
spec:
  replicas: 3
  revisionHistoryLimit: 5
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        fsGroup: 1000
      containers:
        - name: nginx
          image: nginx:1.25
          ports:
            - containerPort: 80
          resources:
            requests:
              cpu: "100m"
              memory: "128Mi"
            limits:
              cpu: "300m"
              memory: "256Mi"
          readinessProbe:
            httpGet:
              path: /
              port: 80
            initialDelaySeconds: 3
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /
              port: 80
            initialDelaySeconds: 10
            periodSeconds: 20
          securityContext:
            readOnlyRootFilesystem: true
            allowPrivilegeEscalation: false

Conclusion

This Kubernetes cheat sheet provides a complete quick-reference for daily cluster operations, including context switching, applying manifests, rolling updates, patching, scaling, and debugging. With the included production-ready Deployment YAML and working examples, you can confidently operate Kubernetes clusters and deploy applications using the recommended declarative approach.

Deploying Lustre File System with RDMA, Node Maps, and ACLs

Lustre is the de facto parallel file system for high-performance computing (HPC) clusters, providing extreme scalability, high throughput, and low-latency access across thousands of nodes. This guide walks through a complete deployment of Lustre using RDMA over InfiniBand for performance, along with Node Maps for client access control and ACLs for fine-grained permissions.


1. Understanding the Lustre Architecture

Lustre separates metadata and data services into distinct roles:

  • MGS (Management Server) – Manages Lustre configuration and coordinates cluster services.
  • MDT (Metadata Target) – Stores file system metadata (names, permissions, directories).
  • OST (Object Storage Target) – Stores file data blocks.
  • Clients – Mount and access the Lustre file system for I/O.

The typical architecture looks like this:

+-------------+        +-------------+
|   Client 1  |        |   Client 2  |
| /mnt/lustre |        | /mnt/lustre |
+------+------+        +------+------+
       |                        |
       +--------o2ib RDMA-------+
                |
        +-------+-------+
        |     OSS/OST    |
        |   (Data I/O)   |
        +-------+-------+
                |
        +-------+-------+
        |     MGS/MDT    |
        |  (Metadata)    |
        +---------------+

2. Prerequisites and Environment

ComponentRequirements
OSRHEL / Rocky / AlmaLinux 8.x or higher
KernelBuilt with Lustre and OFED RDMA modules
NetworkInfiniBand fabric (Mellanox or compatible)
Lustre Version2.14 or later
DevicesSeparate block devices for MDT, OST(s), and client mount

3. Install Lustre Packages

On MGS, MDT, and OSS Nodes:

dnf install -y lustre kmod-lustre lustre-osd-ldiskfs

On Client Nodes:

dnf install -y lustre-client kmod-lustre-client

4. Configure InfiniBand and RDMA (o2ib)

InfiniBand provides the lowest latency for Lustre communication via RDMA. Configure the o2ib network type for Lustre.

1. Install and verify InfiniBand stack

dnf install -y rdma-core infiniband-diags perftest libibverbs-utils
systemctl enable --now rdma
ibstat

2. Configure IB network

nmcli con add type infiniband ifname ib0 con-name ib0 ip4 10.0.0.1/24
nmcli con up ib0

3. Verify RDMA link

ibv_devinfo
ibv_rc_pingpong -d mlx5_0

4. Configure LNET for o2ib

Create /etc/modprobe.d/lustre.conf with:

options lnet networks="o2ib(ib0)"
modprobe lnet
lnetctl lnet configure
lnetctl net add --net o2ib --if ib0
lnetctl net show

Expected output:

net:
  - net type: o2ib
    interfaces:
      0: ib0

5. Format and Mount Lustre Targets

Metadata Server (MGS + MDT)

mkfs.lustre --fsname=lustrefs --mgs --mdt --index=0 /dev/sdb
mount -t lustre /dev/sdb /mnt/mdt

Object Storage Server (OSS)

mkfs.lustre --fsname=lustrefs --ost --index=0 --mgsnode=<MGS>@o2ib /dev/sdc
mount -t lustre /dev/sdc /mnt/ost

Client Node

mount -t lustre <MGS>@o2ib:/lustrefs /mnt/lustre
sudo mkdir -p /mnt/lustre

sudo mount -t lustre \
  172.16.0.10@o2ib:/lustrefs \
  /mnt/lustre

example without ibnetwork
[root@vbox ~]# mount -t lustre 172.16.0.10@tcp:/lustre /mnt/lustre-client
[root@vbox ~]# 
[root@vbox ~]# # Verify the mount worked
[root@vbox ~]# df -h /mnt/lustre-client
Filesystem                Size  Used Avail Use% Mounted on
172.16.0.10@tcp:/lustre   12G  2.5M   11G   1% /mnt/lustre-client
[root@vbox ~]# lfs df -h
UUID                       bytes        Used   Available Use% Mounted on
lustre-MDT0000_UUID         4.5G        1.9M        4.1G   1% /mnt/lustre-client[MDT:0]
lustre-OST0000_UUID         7.5G        1.2M        7.0G   1% /mnt/lustre-client[OST:0]
lustre-OST0001_UUID         3.9G        1.2M        3.7G   1% /mnt/lustre-client[OST:1]
filesystem_summary:        11.4G        2.4M       10.7G   1% /mnt/lustre-client

6. Configuring Node Maps (Access Control)

Node maps allow administrators to restrict Lustre client access based on network or host identity.

1. View current node maps

lctl nodemap_list

2. Create a new node map for trusted clients

lctl nodemap_add trusted_clients

3. Add allowed network range or host

lctl nodemap_add_range trusted_clients 10.0.0.0/24

4. Enable enforcement

lctl set_param nodemap.trusted_clients.admin=1
lctl set_param nodemap.trusted_clients.trust_client_ids=1

5. Restrict default map

lctl set_param nodemap.default.reject_unauthenticated=1

This ensures only IPs in 10.0.0.0/24 can mount and access the Lustre filesystem.


7. Configuring Access Control Lists (ACLs)

Lustre supports standard POSIX ACLs for fine-grained directory and file permissions.

1. Enable ACL support on mount

mount -t lustre -o acl <MGS>@o2ib:/lustrefs /mnt/lustre

2. Verify ACL support

mount | grep lustre

Should show:

/dev/sda on /mnt/lustre type lustre (rw,acl)

3. Set ACLs on directories

setfacl -m u:researcher:rwx /mnt/lustre/projects
setfacl -m g:analysts:rx /mnt/lustre/reports

4. View ACLs

getfacl /mnt/lustre/projects

Sample output:

# file: projects
# owner: root
# group: root
user::rwx
user:researcher:rwx
group::r-x
group:analysts:r-x
mask::rwx
other::---

8. Verifying Cluster Health

On all nodes:

lctl ping <MGS>@o2ib
lctl dl
lctl get_param -n net.*.state

Check RDMA performance:

lctl get_param -n o2iblnd.*.stats

Check file system mount from client:

df -h /mnt/lustre

Optional: Check node map enforcement

Try mounting from an unauthorized IP — it should fail:

mount -t lustre <MGS>@o2ib:/lustrefs /mnt/test
mount.lustre: mount <MGS>@o2ib:/lustrefs at /mnt/test failed: Permission denied

9. Common Issues and Troubleshooting

IssuePossible CauseResolution
Mount failed: no route to hostIB subnet mismatch or LNET not configuredVerify lnetctl net show and ping -I ib0 between nodes.
Permission deniedNode map restriction activeCheck lctl nodemap_list and ensure client IP range is allowed.
Slow performanceRDMA disabled or fallback to TCPVerify lctl list_nids shows @o2ib transport.

10. Final Validation Checklist

  • InfiniBand RDMA verified with ibv_rc_pingpong
  • LNET configured for o2ib(ib0)
  • MGS, MDT, and OST mounted successfully
  • Clients connected via @o2ib
  • Node maps restricting unauthorized hosts
  • ACLs correctly enforcing directory-level access

Summary

With RDMA transport, Lustre achieves near line-rate performance while node maps and ACLs enforce robust security and access control. This combination provides a scalable, high-performance, and policy-driven storage environment ideal for AI, HPC, and research workloads.