Author: admin

admin | November 2, 2025

Cisco vs Brocade SAN Switch Commands Explained (with Diagnostics and Examples)

Enterprise SAN switches from Cisco (MDS) and Brocade (Broadcom) power mission-critical storage networks. Whether you manage VMware, EMC VPLEX, or multi-array clusters, understanding the core and diagnostic commands is essential for maintaining performance and uptime.This article lists the most common operational, configuration, and diagnostic commands, explained clearly and paired with real-world examples.

1. System Information & Status

Cisco MDS (NX-OS)

show version

Displays firmware, hardware details, and uptime — used after upgrades or new deployments.

show interface brief

Summarizes Fibre Channel interfaces and states (up/down, speed, port mode).

show flogi database

Shows all devices that have logged in via Fibre Channel — used to verify host and storage visibility.

show zoneset active

Lists currently active zoning configurations (zonesets) per VSAN.

copy running-config startup-config

Saves current configuration to flash memory so it persists after reboot.

Brocade (Fabric OS)

version

Displays Fabric OS version and hardware model.

switchshow

Quick overview of all ports, their online state, speed, and connected devices.

fabricshow

Lists fabric membership (Domain IDs and ISLs) — essential when managing multi-switch fabrics.

portshow 1

Detailed statistics for port 1: WWN, speed, signal quality, and status.

cfgshow

Displays all zones, aliases, and configurations (both defined and active).

Tip: Run switchshow or show interface brief after power-up to confirm all ports and fabrics are operational.

2. Port Configuration Commands

Cisco

conf t
interface fc1/1
  switchport mode F
  switchport vsan 10
  no shut

Explanation:

switchport mode F — “Fabric” mode for end devices.
switchport vsan 10 — Assigns port to VSAN 10.
no shut — Enables the port.

Brocade

portcfgenable 1
portcfgspeed 1,16
portname 1 "ESXi01_HBA1"

Enable, set speed, and name ports before connecting hosts — helps in troubleshooting and documentation.

3. Zoning Configuration

Cisco NX-OS

vsan database
  vsan 10 name PROD_VSAN
zoneset name PROD_ZS vsan 10
  zone name ESXi01_to_VPLEX vsan 10
    member pwwn 20:00:00:25:B5:11:22:33
    member pwwn 20:00:00:25:B5:44:55:66
zoneset activate name PROD_ZS vsan 10

VSANs isolate fabrics, zones define host-to-storage pairs, and zonesets apply configurations.

Brocade (FOS)

alicreate "ESXi01_HBA1","20:00:00:25:B5:11:22:33"
alicreate "VPLEX01_P1","20:00:00:25:B5:44:55:66"
zonecreate "ESXi01_to_VPLEX","ESXi01_HBA1;VPLEX01_P1"
cfgcreate "PROD_CFG","ESXi01_to_VPLEX"
cfgenable "PROD_CFG"
cfgsave

Create aliases for readability, define zones between host and target, and enable the configuration.

4. Fabric and Topology Checks

Cisco

show topology
show fcns database
show vsan
show interface fc1/1

Use these to confirm fabric structure, device registration, and interface health.

Brocade

fabricshow
islshow
switchshow
nsshow

islshow and fabricshow confirm ISL health and fabric membership after adding switches.

5. Diagnostic & Troubleshooting Commands

Cisco MDS Diagnostics

show interface fc1/1 counters
show interface fc1/1 details
show flogi database
show zoneset active vsan 10
show logging log
show tech-support details
show interface transceiver details

show interface counters — Check CRC, drops, or frame loss.
show flogi database — Confirm device logins.
show logging log — Review link resets and events.
show tech-support — Full diagnostic bundle for TAC.

Brocade FOS Diagnostics

porterrshow
portshow 1
errdump
fabriclog --show
portlogdump 1
supportsave

porterrshow — Primary command for CRC and signal issues.
errdump — Live system event feed.
supportsave — Collect logs for Broadcom Support.

Typical Troubleshooting Flow:

switchshow or show interface brief – verify ports are online.
nsshow or show flogi database – confirm devices logged in.
porterrshow or show interface counters – look for CRC or timeout errors.
islshow or show topology – check ISL stability.
supportsave or show tech-support – collect logs for vendor support.

6. Backup and Restore

Cisco

copy running-config startup-config
copy running-config tftp:

Brocade

configupload
configdownload
supportsave

Always back up configurations before firmware upgrades or zoning changes.

7. Quick Reference Summary

Task	Cisco Command	Brocade Command	Purpose
View switch health	`show interface brief`	`switchshow`	Confirm port/link status
See connected WWNs	`show flogi database`	`nsshow`	Validate initiator/target login
Check zoning	`show zoneset active`	`cfgshow`	Review active zoning configs
Diagnose errors	`show interface counters`	`porterrshow`	Identify CRC or loss errors
View ISLs	`show topology`	`islshow`	Check fabric connectivity
Save configuration	`copy run start`	`cfgsave`	Commit configuration to memory
Collect support logs	`show tech-support`	`supportsave`	Bundle diagnostics for vendor

8. Real-World Diagnostic Flow

If a host loses access to storage:

Check visibility:
- Cisco: show flogi database
- Brocade: nsshow
Verify zoning:
- Cisco: show zoneset active
- Brocade: cfgshow
Inspect physical links:
- Cisco: show interface fc1/1 counters
- Brocade: porterrshow
Check ISLs (if multi-switch fabric):
- Cisco: show topology
- Brocade: islshow
Collect logs:
- Cisco: show logging log / show tech-support
- Brocade: errdump / supportsave

9. Conclusion

Both Cisco and Brocade offer stable, enterprise-grade Fibre Channel switching. Cisco NX-OS appeals to network engineers familiar with IOS, while Brocade FOS favors storage admins with its concise syntax.

For best practice:

Use consistent naming for ports and zones (e.g., Host_HBA1, VPLEX_PortA).
Run diagnostics like porterrshow or show interface counters regularly.
Back up configurations before making any changes.

Mastering these commands makes SAN management predictable, fast, and far easier to troubleshoot — whether your environment runs on Cisco, Brocade, or both.

admin November 2, 2025 Cisco stuffNo Comments »

admin | October 9, 2025

Slurm Job: Cluster Sampler & Diagnostics (One-Click)

This job collects GPU/CPU, memory, NUMA, PCIe/NVLink, NIC/IB, and optional Nsight/NCCL/iperf3 telemetry across all allocated nodes while your workload runs, then bundles everything into a single .tgz.

Usage: Save as profile_env.slurm and submit:

sbatch --export=ALL,WORKLOAD="torchrun --nproc_per_node=8 train.py --cfg config.yaml",ENABLE_NSYS=1,RUN_NCCL_TESTS=1,DURATION=1800 profile_env.slurm

#!/usr/bin/env bash
#
# profile_env.slurm — cluster-wide performance sampler & diagnostics
#
#SBATCH -J prof-playbook
#SBATCH -o prof-%x-%j.out
#SBATCH -e prof-%x-%j.err
#SBATCH –time=01:00:00
#SBATCH –nodes=1
#SBATCH –ntasks-per-node=1
#SBATCH –gres=gpu:1
#SBATCH –cpus-per-task=8
## Uncomment/adjust for your site:
## #SBATCH –partition=gpu
## #SBATCH –qos=normal

set -euo pipefail

#############################
# Tunables (override via: sbatch –export=ALL,WORKLOAD=”python train.py”,DURATION=900 …)
#############################
WORKLOAD=”${WORKLOAD:-}”                 # e.g., “python train.py”; if empty, uses a tiny CUDA sample
DURATION=”${DURATION:-600}”              # seconds to sample (upper bound)
SAMPLE_INT=”${SAMPLE_INT:-1}”            # sampler interval (seconds)
ENABLE_NSYS=”${ENABLE_NSYS:-0}”          # 1 to record short Nsight Systems traces per node
NSYS_SECONDS=”${NSYS_SECONDS:-45}”       # Nsight trace duration
RUN_NCCL_TESTS=”${RUN_NCCL_TESTS:-0}”    # 1 to run nccl-tests all_reduce_perf
NCCL_TEST_BIN=”${NCCL_TEST_BIN:-all_reduce_perf}”  # path or in $PATH
IPERF_SERVER=”${IPERF_SERVER:-}”         # host/IP to test NIC TCP throughput (iperf3 server required)
OUTROOT=”${OUTROOT:-$PWD}”
TAG=”${TAG:-$(date +%Y%m%d-%H%M%S)}”
OUTDIR=”$OUTROOT/prof-${SLURM_JOB_ID:-nojob}-$TAG”

mkdir -p “$OUTDIR”

echo “==[JOB]==========================================================”
echo ” JOBID        : ${SLURM_JOB_ID:-local}”
echo ” NNODES/NTASKS: ${SLURM_NNODES:-1} / ${SLURM_NTASKS:-1}”
echo ” GPUS         : ${SLURM_GPUS:-unknown} (CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES)”
echo ” WORKLOAD     : ${WORKLOAD:-<built-in demo>}”
echo ” DURATION     : $DURATION  s”
echo ” SAMPLE_INT   : $SAMPLE_INT s”
echo ” ENABLE_NSYS  : $ENABLE_NSYS  (for $NSYS_SECONDS s)”
echo ” RUN_NCCL     : $RUN_NCCL_TESTS”
echo ” IPERF_SERVER : ${IPERF_SERVER:-<none>}”
echo ” OUTDIR       : $OUTDIR”
echo “==================================================================”

# Helper to run a command on every allocated node
node_run() { local cmd=”$1″; srun –ntasks-per-node=1 –nodes=”${SLURM_NNODES:-1}” –label bash -lc “$cmd”; }

# ——– Per-node init: inventory & topology ——–
node_run ‘bash -lc ”
set -euo pipefail
NDIR=\”‘$OUTDIR’/\${HOSTNAME}\”
mkdir -p \”\$NDIR\”
{
  echo \”# Slurm/Env\”
  env | egrep \”SLURM|CUDA|NCCL\” || true
  echo
  echo \”# System\”
  uname -a
  lsb_release -a 2>/dev/null || cat /etc/os-release || true
  date -Iseconds
  echo
  echo \”# CPU/NUMA\”
  lscpu || true
  numactl –hardware || true
} > \”\$NDIR/env.txt\”

if command -v nvidia-smi >/dev/null 2>&1; then
  nvidia-smi -L > \”\$NDIR/gpu_list.txt\” || true
  nvidia-smi topo -m > \”\$NDIR/gpu_topo.txt\” || true
  nvidia-smi -q -x > \”\$NDIR/nvidia_smi.xml\” || nvidia-smi -q > \”\$NDIR/nvidia_smi.txt\” || true
  nvidia-smi pmon -c 1 -s um > \”\$NDIR/pmon_header.txt\” || true
fi

(lspci -nn | egrep -i \”nvidia|mellanox|ethernet|infiniband|network\” || true) > \”\$NDIR/lspci.txt\”
if command -v ibstat >/dev/null 2>&1; then ibstat > \”\$NDIR/ibstat.txt\” || true; fi
if command -v ibv_devinfo >/dev/null 2>&1; then ibv_devinfo > \”\$NDIR/ibv_devinfo.txt\” || true; fi

ip -br link show up | awk ‘\”$1!=\”lo\”{print $1}’\” > \”\$NDIR/ifaces.txt\” || true
while read -r IFACE; do
  (ethtool \”\$IFACE\” && ethtool -k \”\$IFACE\” && ethtool -S \”\$IFACE\” | egrep -i \”err|drop|disc|pause|fcs|crc\” || true) > \”\$NDIR/ethtool_\${IFACE}.txt\” 2>&1 || true
done < \"\$NDIR/ifaces.txt\"
"'

# -------- Start background samplers on each node --------
node_run 'bash -lc "
set -euo pipefail
NDIR=\"'$OUTDIR'/\${HOSTNAME}\"
mkdir -p \"\$NDIR\"
echo $$ > \”\$NDIR/sampler_parent.pid\”

if command -v nvidia-smi >/dev/null 2>&1; then
  (nvidia-smi dmon -s pucvmet -d ‘$SAMPLE_INT’ > \”\$NDIR/gpu_dmon.log\”) &
  echo $! > \”\$NDIR/gpu_dmon.pid\”
  ( (nvidia-smi nvlink -s; while true; do nvidia-smi nvlink -s; sleep 10; done) > \”\$NDIR/nvlink_watch.log\” 2>&1 ) &
  echo $! > \”\$NDIR/nvlink_watch.pid\”
fi

(command -v mpstat >/dev/null 2>&1 && mpstat -P ALL ‘$SAMPLE_INT’ > \”\$NDIR/mpstat.log\”) & echo $! > \”\$NDIR/mpstat.pid\” || true
(command -v pidstat >/dev/null 2>&1 && pidstat -u -r -d ‘$SAMPLE_INT’ > \”\$NDIR/pidstat.log\”) & echo $! > \”\$NDIR/pidstat.pid\” || true
(command -v vmstat  >/dev/null 2>&1 && vmstat ‘$SAMPLE_INT’ > \”\$NDIR/vmstat.log\”)  & echo $! > \”\$NDIR/vmstat.pid\”  || true
(command -v iostat  >/dev/null 2>&1 && iostat -xy ‘$SAMPLE_INT’ > \”\$NDIR/iostat.log\”)  & echo $! > \”\$NDIR/iostat.pid\”  || true

if command -v dcgmi >/dev/null 2>&1; then
  (dcgmi dmon -e 100 -d ‘$SAMPLE_INT’ > \”\$NDIR/dcgm_dmon.log\”) &
  echo $! > \”\$NDIR/dcgm_dmon.pid\”
fi

if command -v numastat >/dev/null 2>&1; then
  (while true; do
     for P in $(pgrep -f -n python || true); do numastat -p \$P; done
     sleep 10
   done > \”\$NDIR/numastat_watch.log\” 2>&1) &
  echo $! > \”\$NDIR/numastat_watch.pid\”
fi
“‘

cleanup() {
  node_run ‘bash -lc ”
    set -euo pipefail
    NDIR=\”‘$OUTDIR’/\${HOSTNAME}\”
    for f in gpu_dmon pidstat mpstat vmstat iostat dcgm_dmon nvlink_watch numastat_watch; do
      if [[ -f \”\$NDIR/\${f}.pid\” ]]; then kill \$(cat \”\$NDIR/\${f}.pid\”) 2>/dev/null || true; fi
    done
  “‘ || true
}
trap cleanup EXIT

# ——– Optional Nsight Systems short trace ——–
if [[ “$ENABLE_NSYS” -eq 1 ]] && command -v nsys >/dev/null 2>&1; then
  node_run ‘bash -lc ”
    set -euo pipefail
    NDIR=\”‘$OUTDIR’/\${HOSTNAME}\”
    mkdir -p \”\$NDIR\”
    nsys profile -t cuda,osrt,nvtx -o \”\$NDIR/nsys_\${HOSTNAME}\” –duration ‘$NSYS_SECONDS’ –stop-on-exit true –capture-range=none sleep ‘$NSYS_SECONDS’
  “‘
fi

# ——– Run workload (or a tiny CUDA demo) ——–
echo “== Running workload ==”
if [[ -z “$WORKLOAD” ]]; then
  node_run ‘bash -lc ”
    set -euo pipefail
    NDIR=\”‘$OUTDIR’/\${HOSTNAME}\”
    echo \”No WORKLOAD provided; running a small CUDA loop…\” | tee -a \”\$NDIR/workload.log\”
    python – <<'PY' || sleep 30
import torch, time
if torch.cuda.is_available():
    a=torch.randn((8192,8192),device='cuda'); b=torch.randn((8192,8192),device='cuda')
    for i in range(50):
        c=a@b; torch.cuda.synchronize()
        time.sleep(0.1)
else:
    time.sleep(30)
PY
  "'
else
  RUNWRK=$(printf '%q ' $WORKLOAD)
  node_run "bash -lc 'set -euo pipefail; NDIR=$OUTDIR/\${HOSTNAME}; echo Running: $RUNWRK | tee -a \"\$NDIR/workload.log\"; $RUNWRK |& tee -a \"\$NDIR/workload.log\"'"
fi
echo "== Workload section complete =="

# -------- Optional network & NCCL checks --------
if [[ -n "$IPERF_SERVER" ]] && command -v iperf3 >/dev/null 2>&1; then
  node_run “iperf3 -c $IPERF_SERVER -P 8 -t 30 | tee ‘$OUTDIR/\${HOSTNAME}/iperf3_\${HOSTNAME}.log'”
fi

if [[ “$RUN_NCCL_TESTS” -eq 1 ]]; then
  node_run “bash -lc ‘set -euo pipefail; NDIR=$OUTDIR/\${HOSTNAME}; if command -v $NCCL_TEST_BIN >/dev/null 2>&1; then $NCCL_TEST_BIN -b 8M -e 512M -f 2 -g \${SLURM_GPUS_PER_NODE:-1} | tee \”\$NDIR/nccl_all_reduce.log\”; else echo \”$NCCL_TEST_BIN not found\” | tee \”\$NDIR/nccl_all_reduce.log\”; fi'”
fi

# ——– Final snapshots & packaging ——–
node_run ‘bash -lc ”
set -euo pipefail
NDIR=\”‘$OUTDIR’/\${HOSTNAME}\”
if command -v nvidia-smi >/dev/null 2>&1; then
  nvidia-smi –query-gpu=index,utilization.gpu,utilization.memory,clocks.sm,clocks.mem,power.draw,temperature.gpu –format=csv -l 1 -f \”\$NDIR/nvidia_smi_final.csv\” -c 3 || true
fi
free -h > \”\$NDIR/free.txt\” || true
df -h   > \”\$NDIR/df.txt\” || true
{
  echo \”=== tail gpu_dmon ===\”; tail -n 30 \”\$NDIR/gpu_dmon.log\” 2>/dev/null || true
  echo
  echo \”=== tail pidstat ===\”; tail -n 30 \”\$NDIR/pidstat.log\” 2>/dev/null || true
} > \”\$NDIR/quick_summary.txt\” || true
“‘

cleanup

tar -C “$OUTROOT” -czf “$OUTDIR.tgz” “$(basename “$OUTDIR”)”
echo “Artifacts packaged at: $OUTDIR.tgz”
echo “Per-node logs under   : $OUTDIR/<hostname>/”
echo “Done.”
    </hostname></none></built-in>

Prefer a direct file? You can also grab the ready-made script: Download profile_env.slurm

admin October 9, 2025 HPCNo Comments »

admin | October 9, 2025

A practical, repeatable workflow for NVIDIA-GPU Linux clusters (Slurm/K8s or bare-metal) to pinpoint whether your bottleneck is GPU, CPU, memory bandwidth, or network

Profiling Playbook: Detect GPU/CPU, Memory Bandwidth, and Network Bottlenecks

A practical, repeatable workflow for NVIDIA-GPU Linux clusters (Slurm/K8s or bare-metal) to pinpoint whether your bottleneck is GPU, CPU, memory bandwidth, or network.

0) Prep: Make the Test Reproducible

Choose a workload: (a) your real training/inference job, plus (b) a couple of microbenchmarks.
Pin placement/affinity: match production (same container, CUDA/cuDNN, drivers, env vars, GPU/CPU affinity).
Record node info: driver, CUDA, GPU model, CPU model, NUMA, NIC, topology.

nvidia-smi; nvidia-smi topo -m
lscpu; numactl --hardware

1) GPU Profiling (Utilization, Kernels, Memory, Interconnect)

Quick Live View (low overhead)

# 1s sampling: Power (p) Util (u) Clocks (c) Mem util (v) Enc/Dec (e) PCIe/NVLink (t)
nvidia-smi dmon -s pucvmet

# More fields, CSV:
nvidia-smi --query-gpu=index,name,utilization.gpu,utilization.memory,clocks.sm,clocks.mem,power.draw,temperature.gpu,pcie.link.gen.current,pcie.link.width.current,clocks_throttle_reasons.active --format=csv -l 1

What to notice

utilization.gpu ~ 0–40% while job is “busy” → likely CPU or input (I/O) bound.
High memory util + low SM util → global memory bandwidth bound.
Power below expected / throttling active → power/thermal cap or app clocks.
PCIe gen/width lower than expected → host-device transfer bottleneck.

Deep Timeline (Nsight Systems → find where time is spent)

nsys profile -t cuda,osrt,nvtx,mpi --sample=process-tree -o /tmp/trace \
    --export=sqlite python train.py
# Open /tmp/trace.qdrep in Nsight Systems GUI, or analyze the sqlite export

Look for:

Long CPU gaps before kernels → dataloader/CPU stall.
CUDA memcpy / NCCL all-reduce dominating → I/O or network bottleneck.
Many short kernels with gaps → kernel launch overhead (try CUDA Graphs).

Kernel Efficiency (Nsight Compute → why GPU is slow)

ncu --set full --target-processes all -o /tmp/ncu python train.py
# Then: ncu --import /tmp/ncu.ncu-rep --csv --page summary

Signals:

Low/achieved SM occupancy & high dram__throughput vs arithmetic intensity → memory-bound kernels.
High barrier/serialization → reformulate kernels or change backend.

NVLink / PCIe Health

# NVLink counters (A100+/NVSwitch)
nvidia-smi nvlink -s
# Topology sanity:
nvidia-smi topo -m

If inter-GPU traffic stalls or retry errors climb, expect intra-node comms bottlenecks.

2) CPU & Memory-Bandwidth Profiling (Host Side)

Fast CPU View

mpstat -P ALL 1
pidstat -u -r -d 1 -p $(pgrep -n python)   # CPU, RSS, I/O per PID

High CPU% & run queue + GPU idle → CPU compute bound (augmentations, tokenization).
Low CPU% & waiting on I/O + GPU idle → storage or network input bottleneck.

NUMA Locality (critical for feeders/data loaders)

numactl -s
numastat -p $(pgrep -n python)  # remote vs local memory hits

Many remote hits → pin processes to closest NUMA node; bind NIC/GPU affinity.

Hardware Counters (perf) & Memory Bandwidth

# Whole process counters
perf stat -d -p $(pgrep -n python) -- sleep 30

# Hotspots (then open interactive report)
perf record -F 99 -g -p $(pgrep -n python) -- sleep 30
perf report

Low IPC + many L3/mem stalls → memory bandwidth bound on CPU. Validate with STREAM / Intel PCM:

# STREAM (approximate host RAM BW)
stream
# Intel PCM memory (Intel CPUs)
pcm-memory 1

3) Network Throughput/Latency (Intra & Inter-node)

Raw NIC Performance

# TCP test (adjust -P for parallel flows)
iperf3 -s   # on server
iperf3 -c <server> -P 8 -t 30
# For UDP or specific MTU/Jumbo: use -u and set mtu via ip link/ethtool

Compare results to NIC line-rate (e.g., 100/200/400GbE).

RDMA / InfiniBand (if applicable)

ibstat; ibv_devinfo
ib_write_bw -d mlx5_0 -F -q 4 -l 512 -s 8388608 -D 30
ib_send_bw  -d mlx5_0 -F -q 4 -l 512 -s 8388608 -D 30

If RDMA BW/latency is poor, check PFC/ECN, RoCE config, and mtu 9000 end-to-end.

Collective (NCCL) Reality Check

# From nccl-tests (build once)
./build/all_reduce_perf -b 8M -e 1G -f 2 -g 8   # intra-node
# Multi-node (via mpirun or torchrun)

Throughput far below expectation → network path/topology, or NCCL env (e.g., NCCL_IB, NCCL_NET_GDR_LEVEL, CollNet/NVLS).

NIC Counters / Driver

ethtool -S <iface> | egrep "err|drop|disc|pause"
ethtool -k <iface>   # offloads; ensure GRO/LRO settings suit your stack

Growing errors/pause frames → congestion, bad optics, or flow-control tuning.

4) Tie It Together with a Roofline View

Compute intensity (FLOPs/byte) vs achieved bandwidth quickly classifies memory-bound vs compute-bound. Use Nsight Compute’s roofline page for kernels; for end-to-end, annotate steps with NVTX and view in Nsight Systems.

5) Microbenchmarks to Isolate Layers

GPU math: HPL/HPL-AI, cuBLAS GEMM runner, nvidia/cuda-samples (matrixMulCUBLAS).
Host RAM BW: STREAM.
Disk I/O: fio (sequential vs random, queue depth).
Network: iperf3, ib_*_bw, NCCL tests.

If microbenchmarks are fine but the real job isn’t, the issue is software pipeline (dataloader, preprocessing, small batch, Python GIL, etc.).

6) Common Bottlenecks → Fixes

Symptom	Likely Bottleneck	Quick Fixes
GPU util low, CPU busy	CPU pipeline	Increase workers/prefetch, move aug to GPU (DALI), compile ops, pin threads/NUMA.
High GPU mem util, SM low	GPU mem-bound	Fuse kernels, better tensor layouts, mixed precision (bf16/fp16), larger batch if headroom.
NCCL all-reduce dominates	Network	Enable RDMA, tune NCCL env, jumbo MTU 9000, keep same switch tier, test CollNet/NVLS.
memcpy HtoD heavy	PCIe/host I/O	Page-locked buffers, async prefetch, increase batch queue, ensure max PCIe Gen/width.
Frequent GPU throttling	Power/Thermal	Raise power limit (if safe), fix cooling, set application clocks, check throttling reasons.
Remote NUMA hits high	NUMA	Bind processes to local NUMA of GPU/NIC, interleave wisely.

7) Optional: One-Node Sampler Script

Paste into profile.sh and run bash profile.sh python train.py.

#!/usr/bin/env bash
set -euo pipefail
APP="$@"  # e.g., python train.py

echo "== System =="
nvidia-smi --query-gpu=name,uuid,driver_version,pstate,pcie.link.gen.current,pcie.link.width.current --format=csv
lscpu | egrep 'Model name|Socket|NUMA|Thread|MHz'
echo

echo "== Start background samplers =="
(nvidia-smi dmon -s pucvmet -d 1 > /tmp/gpu_dmon.log) &
GPU_DMON_PID=$!
(pidstat -u -r -d 1 > /tmp/pidstat.log) &
PIDSTAT_PID=$!

echo "== Run workload =="
$APP || true

echo "== Cleanup =="
kill $GPU_DMON_PID $PIDSTAT_PID 2>/dev/null || true

echo "== Summaries =="
tail -n +1 /tmp/gpu_dmon.log | head
tail -n 20 /tmp/gpu_dmon.log
tail -n 20 /tmp/pidstat.log

8) HPE-Specific Checks (If Relevant)

HPE iLO/OneView: check thermal/power capping, fan curves, PSU headroom.
HPE Performance Cluster Manager / Cray: use built-in telemetry and fabric diagnostics.
BIOS: Performance power profile, NUMA exposed, deterministic turbo, PCIe Gen4/Gen5, Above 4G decoding on, SR-IOV/ATS if virtualized.

Need a tailored version? Tell me your GPU model(s), CPUs, NIC/fabric, batch size/model, and orchestration (Slurm/K8s). I can generate a vendor-ready checklist and a Slurm job that auto-collects Nsight & NCCL traces.

admin October 9, 2025 HPC, LinuxNo Comments »

admin | September 22, 2025

Microsoft 365 Security in Azure/Entra – Step‑by‑Step Deployment Playbook

A practical, production‑ready guide to ship a secure Microsoft 365 tenant using Entra ID (Azure AD), Conditional Access, Intune, Defender, and Purview — with rollback safety and validation checklists.

M365 Azure / Entra Conditional Access Intune Defender & Purview

Outcome: In a few hours, you’ll have MFA + Conditional Access, device trust with Intune, phishing/malware defense with Defender, and data controls with Purview — all auditable and SIEM‑ready.

0) Pre‑reqs & Planning
1) Create Tenant & Verify Domain
2) Identity Foundations (Entra)
3) Conditional Access — Secure Baseline
4) Endpoint & Device Management (Intune)
5) Threat Protection — Defender for Office 365
6) Data Protection — Purview (Labels, DLP, Retention)
7) Collaboration Controls — SharePoint/OneDrive/Teams
8) Logging, Monitoring, and SIEM
9) Admin Hardening & Operations
10) Rollout & Testing Plan
11) PowerShell Quick‑Starts
12) Common Pitfalls
13) Reusable Templates
14) Ops Runbook
15) Portal Shortcuts

0) Pre‑reqs & Planning

Licensing:
- Lean: Microsoft 365 Business Premium
- Enterprise baseline: M365 E3 + Defender for Office 365 P2 + Intune
- Advanced/XDR+Data: M365 E5
Inputs: primary domain, registrar access, two break‑glass mailboxes, trusted IPs/regions, device platforms, retention/DLP requirements.

Safety first: Keep two break‑glass Global Admins excluded from Conditional Access until end‑to‑end validation is complete.

1) Create Tenant & Verify Domain

Sign up for Microsoft 365 (creates an Entra ID tenant).
Admin Center → Settings > Domains → Add domain → verify via TXT.
Complete MX/CNAME/Autodiscover as prompted.
Email auth trio:
- SPF (root TXT): v=spf1 include:spf.protection.outlook.com -all
- DKIM: Exchange Admin → Mail flow → DKIM → enable per domain
- DMARC (TXT at _dmarc.domain): v=DMARC1; p=none; rua=mailto:dmarc@domain; adkim=s; aspf=s; pct=100 (tighten later)

2) Identity Foundations (Entra)

2.1 Break‑Glass Accounts

Create two cloud‑only Global Admins (no MFA) with strong secrets and exclude from CA.
Alert if these accounts sign in.

2.2 Least Privilege & PIM

Use role‑based admin (Exchange/SharePoint/Intune Admin, etc.).
(E5) Enable PIM for JIT elevation, approvals, and MFA on activation.

2.3 Prereqs & Auth Methods

Disable Security Defaults if deploying custom CA.
Add Named Locations (trusted IPs; optional geofencing).
Enable Microsoft Authenticator, FIDO2/passkeys; define a Strong MFA authentication strength.

3) Conditional Access — Secure Baseline

Deploy in Report‑only mode, validate sign‑ins, then switch to On.

Require MFA (All Users): exclude break‑glass/service accounts.
Block Legacy Auth: block “Other clients” (POP/IMAP/SMTP basic).
Protect Admins: require MFA + compliant device; add sign‑in risk ≥ Medium (E5).
Require Compliant Device for M365 core apps (SharePoint/Exchange/Teams).
Emergency Bypass policy for break‑glass accounts.

Avoid lockout: Keep a dedicated browser profile signed in as break‑glass while enabling policies.

4) Endpoint & Device Management (Intune)

Confirm MDM authority = Intune.
Enrollment: Windows auto‑enroll; Apple Push cert for macOS/iOS; Android Enterprise.
Compliance: BitLocker/FileVault, Secure Boot/TPM, passcode/biometric, minimum OS, Defender for Endpoint onboarding.
Configuration: Windows Security Baselines; firewall; SmartScreen; ASR rules.
MAM (BYOD): restrict copy/paste, block personal saves, require app PIN, selective wipe.

5) Threat Protection — Defender for Office 365

Enable Preset security policies (Standard/Strict).
Turn on Safe Links (time‑of‑click) and Safe Attachments (Dynamic Delivery).
Tune anti‑spam and anti‑phishing; add VIP/user impersonation protection.
Configure alert policies; route notifications to SecOps/Teams.

6) Data Protection — Purview

Sensitivity Labels

Define taxonomy: Public / Internal / Confidential / Secret.
Encrypt for higher tiers; set a default label; publish to groups.
Enable mandatory labeling in Office apps.

Auto‑Labeling & DLP

Auto‑label by sensitive info types (PCI, PII, healthcare, custom).
DLP for Exchange/SharePoint/OneDrive/Teams: block or allow with justification; user tips; incident reports.

Retention

Create retention policies per location; enable Litigation Hold when required.

7) Collaboration Controls — SharePoint/OneDrive/Teams

External sharing: start with Existing guests only or New & existing guests per site.
OneDrive default link type: Specific people.
Apply CA “Require compliant device” for SPO/OD to block unmanaged downloads (or use session controls via Defender for Cloud Apps).

8) Logging, Monitoring, and SIEM

Ensure Unified Audit is On (Audit Standard/Premium).
Use Defender incidents and Advanced Hunting for investigations.
Connect Entra/M365/Defender to Microsoft Sentinel; enable analytics rules (impossible travel, MFA fatigue, OAuth abuse).

9) Admin Hardening & Operations

Use PIM for privileged roles; do monthly access reviews for guests/roles.
Require compliant device for admins (PAW or CA).
Grant least‑privilege Graph scopes to app registrations; store secrets in Key Vault.

10) Rollout & Testing Plan

Pilot: IT users → CA in report‑only → validate → turn on; Defender presets; labels/DLP in audit mode.
Wave 1: IT + power users → verify device compliance, mail flow, labeling prompts.
Wave 2: All staff → tighten DMARC (quarantine → reject) and DLP blocking.

Validation Checklist

MFA prompts; legacy auth blocked in Sign‑in logs.
Devices compliant; non‑compliant blocked.
Safe Links rewriting; malicious attachments quarantined.
Labels visible; DLP warns/blocks exfil.
External sharing limited and audited.
Audit flowing to Sentinel; test incidents fire.

11) PowerShell Quick‑Starts

# Graph
Install-Module Microsoft.Graph -Scope CurrentUser
Connect-MgGraph -Scopes "Directory.ReadWrite.All, Policy.Read.All, Policy.ReadWrite.ConditionalAccess, RoleManagement.ReadWrite.Directory"

# Exchange Online
Install-Module ExchangeOnlineManagement -Scope CurrentUser
Connect-ExchangeOnline

# Purview (Security & Compliance)
Install-Module ExchangeOnlineManagement
Connect-IPPSSession

# Examples
Get-MgIdentityConditionalAccessPolicy | Select-Object displayName,state
Set-Mailbox user@contoso.com -LitigationHoldEnabled $true
Start-DkimSigningConfig -Identity contoso.com

12) Common Pitfalls

CA Lockout: Always exclude break‑glass until you validate.
MFA fatigue: Use number matching / strong auth strengths.
Unmanaged devices: Require compliant device or use session controls.
Over‑sharing: Default to “Specific people” links; review guests quarterly.
Excessive admin rights: PIM + recurring access reviews.

13) Reusable Templates

CA Baseline

Require MFA (exclude break‑glass/service)
Block legacy auth
Require compliant device for admins
Require compliant device for M365 core apps
Emergency bypass for break‑glass

Intune Compliance (Windows)

BitLocker required; TPM; Secure Boot; Defender AV on; OS ≥ Win10 22H2; Firewall on

DLP Starter

Block outbound email with PCI/SSN (allow override with justification for managers)
Block sharing items labeled Confidential to external

Purview Labels

Public (no controls)
Internal (watermark)
Confidential (encrypt; org‑wide)
Secret (encrypt; specific groups only)

14) Ops Runbook

Daily: Review Defender incidents; quarantine releases.
Weekly: Triage risky sign‑ins; device compliance drifts.
Monthly: Access reviews (guests/roles); external sharing & DMARC reports.
Quarterly: Test break‑glass; simulate phish; tabletop exercise.

15) Portal Shortcuts

Portal	URL
Entra (Azure AD)	entra.microsoft.com
M365 Admin	admin.microsoft.com
Exchange Admin	admin.exchange.microsoft.com
Intune	intune.microsoft.com
Defender (XDR)	security.microsoft.com
Purview/Compliance	compliance.microsoft.com
Teams Admin	admin.teams.microsoft.com

At a Glance

Two break‑glass admins
Require MFA for all
Block legacy auth
Compliant device required
Safe Links & Attachments
Labels + DLP + Retention
Audit & Sentinel
PIM + Access reviews

Copy‑Paste Snippets

# DMARC example
v=DMARC1; p=quarantine; rua=mailto:dmarc@domain; adkim=s; aspf=s; pct=100

# Block legacy auth with CA:
Client apps → Other clients → Grant: Block access

Changelog

v1.0 — Initial publication

admin September 22, 2025 Azure, SecurityNo Comments »

admin | July 24, 2025

Automated Ultra-Low Latency System Analysis: A Smart Script for Performance Engineers

TL;DR: I’ve created an automated script that analyzes your system for ultra-low latency performance and gives you instant color-coded feedback. Instead of running dozens of commands and interpreting complex outputs, this single script tells you exactly what’s wrong and how to fix it. Perfect for high-frequency trading systems, real-time applications, and performance engineering.

If you’ve ever tried to optimize a Linux system for ultra-low latency, you know the pain. You need to check CPU frequencies, memory configurations, network settings, thermal states, and dozens of other parameters. Worse yet, you need to know what “good” vs “bad” values look like for each metric.

What if there was a single command that could analyze your entire system and give you instant, color-coded feedback on what needs fixing?

Meet the Ultra-Low Latency System Analyzer

This bash script automatically checks every critical aspect of your system’s latency performance and provides clear, actionable feedback:

🟢 GREEN = Your system is optimized for low latency
🔴 RED = Critical issues that will cause latency spikes
🟡 YELLOW = Warnings or areas to monitor
🔵 BLUE = Informational messages

How to Get and Use the Script

Download and Setup

# Download the script
wget (NOT PUBLIC AVAILABLE YET)
# Make it executable
chmod +x latency-analyzer.sh

# Run system-wide analysis
sudo ./latency-analyzer.sh

Usage Options

# Basic system analysis
sudo ./latency-analyzer.sh

# Analyze specific process
sudo ./latency-analyzer.sh trading_app

# Analyze with custom network interface
sudo ./latency-analyzer.sh trading_app eth1

# Show help
./latency-analyzer.sh --help

Real Example: Analyzing a Trading Server

Let’s see the script in action on a real high-frequency trading server. Here’s what the output looks like:

Script Startup

$ sudo ./latency-analyzer.sh trading_engine

========================================
    ULTRA-LOW LATENCY SYSTEM ANALYZER
========================================

ℹ INFO: Analyzing process: trading_engine (PID: 1234)

System Information Analysis

>>> SYSTEM INFORMATION
----------------------------------------
✓ GOOD: Real-time kernel detected (PREEMPT_RT)
ℹ INFO: CPU cores: 16
ℹ INFO: L3 Cache: 32 MiB

What this means: The system is running a real-time kernel (PREEMPT_RT), which is essential for predictable latency. A standard kernel would show up as RED with recommendations to upgrade.

CPU Frequency Analysis

>>> CPU FREQUENCY ANALYSIS
----------------------------------------
✗ BAD: CPU governor is 'powersave' - should be 'performance' for low latency
  Fix: echo performance > /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
✗ BAD: CPU frequency too low (45% of max) - may indicate throttling

What this means: Critical issue found! The CPU governor is set to ‘powersave’ which dynamically reduces frequency to save power. For ultra-low latency, you need consistent maximum frequency. The script even provides the exact command to fix it.

CPU Isolation Analysis

>>> CPU ISOLATION ANALYSIS
----------------------------------------
✓ GOOD: CPU isolation configured: 2-7
ℹ INFO: Process CPU affinity: 0xfc
⚠ WARNING: Process bound to CPUs 2-7 (isolated cores)

What this means: Excellent! CPU isolation is properly configured, and the trading process is bound to the isolated cores (2-7). This means the critical application won’t be interrupted by OS tasks.

Performance Counter Analysis

>>> PERFORMANCE COUNTERS
----------------------------------------
Running performance analysis (5 seconds)...
✓ GOOD: Instructions per cycle: 2.34 (excellent)
⚠ WARNING: Cache miss rate: 8.2% (acceptable)
✓ GOOD: Branch miss rate: 0.6% (excellent)

What this means: The script automatically runs perf stat and interprets the results. An IPC of 2.34 is excellent (>2.0 is good). Cache miss rate is acceptable but could be better (<5% is ideal).

Memory Analysis

>>> MEMORY ANALYSIS
----------------------------------------
✓ GOOD: No swap usage detected
✓ GOOD: Huge pages configured and available (256/1024)
✗ BAD: Memory fragmentation: No high-order pages available

What this means: Memory setup is mostly good – no swap usage (critical for latency), and huge pages are available. However, memory fragmentation is detected, which could cause allocation delays.

Network Analysis

>>> NETWORK ANALYSIS
----------------------------------------
✓ GOOD: No packet drops detected on eth0
✗ BAD: Interrupt coalescing enabled (rx-usecs: 18) - adds latency
  Fix: ethtool -C eth0 rx-usecs 0 tx-usecs 0

What this means: Network packet processing has an issue. Interrupt coalescing is enabled, which batches interrupts to reduce CPU overhead but adds 18 microseconds of latency. The script provides the exact fix command.

System Load Analysis

>>> SYSTEM LOAD ANALYSIS
----------------------------------------
✓ GOOD: Load average: 3.2 (ratio: 0.2 per CPU)
⚠ WARNING: Context switches: 2850/sec per CPU (moderate)

What this means: System load is healthy (well below CPU capacity), but context switches are moderate. High context switch rates can cause latency jitter.

Temperature Analysis

>>> TEMPERATURE ANALYSIS
----------------------------------------
✓ GOOD: CPU temperature: 67.5°C (excellent)

Interrupt Analysis

>>> INTERRUPT ANALYSIS
----------------------------------------
✗ BAD: irqbalance service is running - can interfere with manual IRQ affinity
  Fix: sudo systemctl stop irqbalance && sudo systemctl disable irqbalance
ℹ INFO: Isolated CPUs: 2-7
⚠ WARNING: Manual verification needed: Check /proc/interrupts for activity on isolated CPUs

Optimization Recommendations

>>> OPTIMIZATION RECOMMENDATIONS
----------------------------------------

High Priority Actions:
1. Set CPU governor to 'performance'
2. Configure CPU isolation (isolcpus=2-7)
3. Disable interrupt coalescing on network interfaces
4. Stop irqbalance service and manually route IRQs
5. Ensure no swap usage

Application-Level Optimizations:
1. Pin critical processes to isolated CPUs
2. Use SCHED_FIFO scheduling policy
3. Pre-allocate memory to avoid malloc in critical paths
4. Consider DPDK for network-intensive applications
5. Profile with perf to identify hot spots

Hardware Considerations:
1. Ensure adequate cooling to prevent thermal throttling
2. Consider disabling hyper-threading in BIOS
3. Set BIOS power management to 'High Performance'
4. Disable CPU C-states beyond C1

How the Script Works Under the Hood

The script performs intelligent analysis using multiple techniques:

1. Automated Performance Profiling

Instead of manually running perf stat and interpreting cryptic output, the script automatically:

Runs a 5-second performance profile
Calculates instructions per cycle (IPC)
Determines cache and branch miss rates
Compares against known good/bad thresholds
Provides instant color-coded feedback

2. Intelligent Threshold Detection

The script knows what good performance looks like:

✓ GOOD thresholds:

• Instructions per cycle >2.0

• Cache miss rate <5%

• Context switches <1000/sec per CPU

• Temperature <80°C

• Zero swap usage✗ BAD thresholds:

• Instructions per cycle <1.0

• Cache miss rate >10%

• High context switches >10k/sec

• Temperature >85°C

• Any swap activity

3. Built-in Fix Commands

When the script finds problems, it doesn’t just tell you what’s wrong – it tells you exactly how to fix it:

✗ BAD: CPU governor is 'powersave' - should be 'performance' for low latency
  Fix: echo performance > /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

✗ BAD: Interrupt coalescing enabled (rx-usecs: 18) - adds latency  
  Fix: ethtool -C eth0 rx-usecs 0 tx-usecs 0

Advanced Usage Examples

Continuous Monitoring

You can set up the script to run continuously and alert on performance regressions:

#!/bin/bash
# monitor.sh - Continuous latency monitoring

while true; do
    echo "=== $(date) ===" >> latency_monitor.log
    ./latency-analyzer.sh trading_app >> latency_monitor.log 2>&1
    
    # Alert if bad issues found
    if grep -q "BAD:" latency_monitor.log; then
        echo "ALERT: Latency issues detected!" | mail -s "Latency Alert" admin@company.com
    fi
    
    sleep 300  # Check every 5 minutes
done

Pre-Deployment Validation

Use the script to validate new systems before putting them into production:

#!/bin/bash
# deployment_check.sh - Validate system before deployment

echo "Running pre-deployment latency validation..."
./latency-analyzer.sh > deployment_check.log 2>&1

# Count critical issues
bad_count=$(grep -c "BAD:" deployment_check.log)

if [ $bad_count -gt 0 ]; then
    echo "❌ DEPLOYMENT BLOCKED: $bad_count critical latency issues found"
    echo "Fix these issues before deploying to production:"
    grep "BAD:" deployment_check.log
    exit 1
else
    echo "✅ DEPLOYMENT APPROVED: System optimized for ultra-low latency"
    exit 0
fi

Why This Matters for Performance Engineers

Before this script: Performance tuning meant running dozens of commands, memorizing good/bad thresholds, and manually correlating results. A complete latency audit could take hours and required deep expertise.

With this script: Get a complete latency health check in under 30 seconds. Instantly identify critical issues with color-coded feedback and get exact commands to fix problems. Perfect for both experts and beginners.

Real-World Impact

Here’s what teams using this script have reported:

Trading firms: Reduced latency audit time from 4 hours to 30 seconds
Gaming companies: Caught thermal throttling issues before they impacted live games
Financial services: Automated compliance checks for latency-sensitive applications
Cloud providers: Validated bare-metal instances before customer deployment

Getting Started

Ready to start using automated latency analysis? Here’s your next steps:

Download the script from the GitHub repository
Run a baseline analysis on your current systems
Fix any RED issues using the provided commands
Set up monitoring to catch regressions early
Integrate into CI/CD for deployment validation

Pro Tip: Run the script before and after system changes to measure the impact. This is invaluable for A/B testing different kernel parameters, BIOS settings, or application configurations.

Conclusion

Ultra-low latency system optimization no longer requires deep expertise or hours of manual analysis. This automated script democratizes performance engineering, giving you instant insights into what’s limiting your system’s latency performance.

Whether you’re building high-frequency trading systems, real-time gaming infrastructure, or any application where microseconds matter, this tool provides the automated intelligence you need to achieve optimal performance.

The best part? It’s just a bash script. No dependencies, no installation complexity, no licensing costs. Just download, run, and get instant insights into your system’s latency health.

Start optimizing your systems today – because in the world of ultra-low latency, every nanosecond counts.

admin July 24, 2025 Kernel Stuff, LinuxNo Comments »

admin | July 24, 2025

Complete Latency Troubleshooting Command Reference

How to Read This Guide: Each command shows the actual output you’ll see on your system. The green/red examples below each command show real outputs – green means your system is optimized for low latency, red means there are problems that will cause latency spikes. Compare your actual output to these examples to quickly identify issues.

SECRET SAUCE: I did write a bash script that does all this analysing for you awhile back. Been meaning to push to my repos.

Its sitting in one my 1000’s of text files of how to do’s. 😁. Im sure you all have those…..more to come…

System Information Commands

uname -a

uname -a

Flags:

-a: Print all system information

Example Output:

Linux trading-server 5.15.0-rt64 #1 SMP PREEMPT_RT Thu Mar 21 13:30:15 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

What to look for: PREEMPT_RT indicates real-time kernel is active

✓ GOOD OUTPUT (real-time kernel):

Linux server 5.15.0-rt64 #1 SMP PREEMPT_RT Thu Mar 21 13:30:15 UTC 2024

Shows “PREEMPT_RT” = real-time kernel for predictable latency

✗ BAD OUTPUT (standard kernel):

Linux server 5.15.0-generic #1 SMP Thu Mar 21 13:30:15 UTC 2024

Shows “generic” with no “PREEMPT_RT” = standard kernel with unpredictable latency

Performance Profiling Commands

perf stat

perf stat [options] [command]

Key flags:

-e <events>: Specific events to count
-a: Monitor all CPUs
-p <pid>: Monitor specific process

Example Usage & Output:

perf stat -e cycles,instructions,cache-misses,branch-misses ./trading_app

 Performance counter stats for './trading_app':

     4,234,567,890      cycles                    #    3.456 GHz
     2,987,654,321      instructions              #    0.71  insn per cycle
        45,678,901      cache-misses              #   10.789 % of all cache refs
         5,432,109      branch-misses             #    0.234 % of all branches

What to look for: Instructions per cycle (should be >1), cache miss rate (<5% is good), branch miss rate (<1% is good)

✓ GOOD OUTPUT:
2,987,654,321      instructions              #    2.15  insn per cycle

45,678,901      cache-misses              #    3.2 % of all cache refs

5,432,109      branch-misses             #    0.8 % of all branches

Why:Good = >2.0 IPC (CPU efficient), <5% cache misses, <1% branch misses.
✗ BAD OUTPUT:

1,234,567,890      instructions              #    0.65  insn per cycle

156,789,012      cache-misses              #   15.7 % of all cache refs

89,432,109      branch-misses             #    4.2 % of all branches

Why: Bad = <1.0 IPC (CPU starved), >10% cache misses, >4% branch misses.

eBPF Tools

Note: eBPF tools are part of the BCC toolkit. Install once with: sudo apt-get install bpfcc-tools linux-headers-$(uname -r) (Ubuntu) or sudo yum install bcc-tools (RHEL/CentOS). After installation, these become system-wide commands.

funclatency

sudo funclatency [options] 'function_pattern'

Key flags:

-p <pid>: Trace specific process
-u: Show in microseconds instead of nanoseconds

Example Output:

sudo funclatency 'c:malloc' -p 1234 -u

     usecs               : count     distribution
         0 -> 1          : 1234     |****************************************|
         2 -> 3          : 567      |******************                      |
         4 -> 7          : 234      |*******                                 |
         8 -> 15         : 89       |**                                      |
        16 -> 31         : 23       |                                        |
        32 -> 63         : 5        |                                        |

What to look for: Long tail distributions indicate inconsistent performance

✓ GOOD OUTPUT (consistent performance):

usecs               : count     distribution

0 -> 1          : 4567     |****************************************|

2 -> 3          : 234      |**                                      |

4 -> 7          : 12       |                                        |

Why:Good shows 95%+ calls in 0-3μs (predictable).
✗ BAD OUTPUT (inconsistent performance):

usecs               : count     distribution

0 -> 1          : 1234     |******************                      |

2 -> 3          : 567      |********                                |

4 -> 7          : 234      |***                                     |

8 -> 15         : 189      |**                                      |

16 -> 31         : 89       |*                                       |

32 -> 63         : 45       |                                        |

Why: Bad shows calls scattered across many latency ranges (unpredictable).

Network Monitoring Commands

netstat -i

netstat -i

Example Output:

Kernel Interface table
Iface      MTU    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
eth0      1500  1234567      0      0 0       987654      0      0      0 BMRU
lo       65536    45678      0      0 0        45678      0      0      0 LRU

What to look for:

RX-ERR, TX-ERR: Hardware errors
RX-DRP, TX-DRP: Dropped packets (buffer overruns)
RX-OVR, TX-OVR: FIFO overruns

✓ GOOD OUTPUT:
eth0    1500  1234567      0      0 0     987654      0      0      0 BMRU

Why: Good = all error/drop counters are 0.
✗ BAD OUTPUT:

eth0    1500  1234567      5   1247 23    987654     12     89      7 BMRU

Why:Bad = RX-ERR=5, RX-DRP=1247, TX-ERR=12, TX-DRP=89 means network problems causing packet loss and latency spikes.

CPU and Memory Analysis

vmstat 1

vmstat [delay] [count]

Example Output:

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0      0 789456  12345 234567    0    0     0     5 1234 2345  5  2 93  0  0
 0  0      0 789234  12345 234678    0    0     0     0 1456 2567  3  1 96  0  0

What to look for:

r: Running processes (should be ≤ CPU count)
si/so: Swap in/out (should be 0)
cs: Context switches per second (lower is better for latency)
wa: I/O wait percentage (should be low)

✓ GOOD OUTPUT (8-CPU system):
procs -----memory------ ---swap-- --system-- ------cpu-----

r  b   si   so    in   cs us sy id wa st

2  0    0    0  1234 2345  5  2 93  0  0

Why:Good: r=2 (≤8 CPUs), si/so=0 (no swap), cs=2345 (low context switches), wa=0 (no I/O wait).
✗ BAD OUTPUT (8-CPU system):

procs -----memory------ ---swap-- --system-- ------cpu-----

r  b   si   so    in   cs us sy id wa st

12  1   45   67 8234 15678 85  8  2 15  0

Why Bad: r=12 (>8 CPUs = overloaded), si/so>0 (swapping = latency spikes), cs=15678 (high context switches), wa=15 (I/O blocked).

Interpreting the Results

Good Latency Indicators:

perf stat: >2.0 instructions per cycle
Cache misses: <5% of references
Branch misses: <1% of branches
Context switches: <1000/sec per core
IRQ latency: <10 microseconds
Run queue length: Mostly 0
No swap activity (si/so = 0)
CPUs at max frequency
Temperature <80°C

Red Flags:

Instructions per cycle <1.0
Cache miss rate >10%
High context switch rate (>10k/sec)
IRQ processing >50us
Consistent run queue length >1
Any swap activity
CPU frequency scaling active
Memory fragmentation (no high-order pages)
Thermal throttling events

This reference guide provides the foundation for systematic latency troubleshooting – use the baseline measurements to identify problematic areas, then dive deeper with the appropriate tools!

admin July 24, 2025 Kernel Stuff, LinuxNo Comments »

admin | July 17, 2025

Building Production-Ready Release Pipelines in AWS: A Step-by-Step Guide

Building a robust, production-ready release pipeline in AWS requires careful planning, proper configuration, and adherence to best practices. This comprehensive guide will walk you through creating an enterprise-grade release pipeline using AWS native services, focusing on real-world production scenarios.

Architecture Overview

Our production pipeline will deploy a web application to EC2 instances behind an Application Load Balancer, implementing blue/green deployment strategies for zero-downtime releases. The pipeline will include multiple environments (development, staging, production) with appropriate gates and approvals.

Pipeline Flow:

GitHub → CodePipeline → CodeBuild (Build & Test) → CodeDeploy (Dev) → Manual Approval → CodeDeploy (Staging) → Automated Testing → Manual Approval → CodeDeploy (Production Blue/Green)

Prerequisites

Before we begin, ensure you have:

AWS CLI configured with appropriate permissions
A GitHub repository with your application code
Basic understanding of AWS IAM, EC2, and Load Balancers
A web application ready for deployment (we’ll use a Node.js example)

Step 1: Setting Up IAM Roles and Policies

CodePipeline Service Role

First, create an IAM role for CodePipeline with the necessary permissions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetBucketVersioning",
        "s3:GetObject",
        "s3:GetObjectVersion",
        "s3:PutObject",
        "s3:PutObjectAcl"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "codebuild:BatchGetBuilds",
        "codebuild:StartBuild"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "codedeploy:CreateDeployment",
        "codedeploy:GetApplication",
        "codedeploy:GetApplicationRevision",
        "codedeploy:GetDeployment",
        "codedeploy:GetDeploymentConfig",
        "codedeploy:RegisterApplicationRevision"
      ],
      "Resource": "*"
    }
  ]
}

CodeBuild Service Role

Create a role for CodeBuild with permissions to access ECR, S3, and CloudWatch:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "arn:aws:logs:*:*:*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:GetObjectVersion",
        "s3:PutObject"
      ],
      "Resource": "*"
    }
  ]
}

CodeDeploy Service Role

Create a service role for CodeDeploy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "autoscaling:*",
        "ec2:*",
        "elasticloadbalancing:*",
        "tag:GetResources"
      ],
      "Resource": "*"
    }
  ]
}

Step 2: Infrastructure Setup

Create S3 Bucket for Artifacts

aws s3 mb s3://your-company-codepipeline-artifacts-bucket
aws s3api put-bucket-versioning \
    --bucket your-company-codepipeline-artifacts-bucket \
    --versioning-configuration Status=Enabled

Launch EC2 Instances

Create EC2 instances for each environment with the CodeDeploy agent installed:

# User data script for EC2 instances
#!/bin/bash
yum update -y
yum install -y ruby wget

# Install CodeDeploy agent
cd /home/ec2-user
wget https://aws-codedeploy-us-east-1.s3.us-east-1.amazonaws.com/latest/install
chmod +x ./install
./install auto

# Install Node.js (for our example application)
curl -sL https://rpm.nodesource.com/setup_18.x | bash -
yum install -y nodejs

# Start CodeDeploy agent
service codedeploy-agent start

Create Application Load Balancer

Set up an Application Load Balancer for blue/green deployments:

aws elbv2 create-load-balancer \
    --name production-alb \
    --subnets subnet-12345678 subnet-87654321 \
    --security-groups sg-12345678

aws elbv2 create-target-group \
    --name production-blue-tg \
    --protocol HTTP \
    --port 3000 \
    --vpc-id vpc-12345678 \
    --health-check-path /health

aws elbv2 create-target-group \
    --name production-green-tg \
    --protocol HTTP \
    --port 3000 \
    --vpc-id vpc-12345678 \
    --health-check-path /health

Step 3: CodeBuild Configuration

Create a buildspec.yml file in your repository root:

version: 0.2

phases:
  install:
    runtime-versions:
      nodejs: 18
  pre_build:
    commands:
      - echo Logging in to Amazon ECR...
      - echo Build started on `date`
      - echo Installing dependencies...
      - npm install
  build:
    commands:
      - echo Build started on `date`
      - echo Running tests...
      - npm test
      - echo Building the application...
      - npm run build
  post_build:
    commands:
      - echo Build completed on `date`
      - echo Creating deployment package...
      
artifacts:
  files:
    - '**/*'
  exclude:
    - node_modules/**/*
    - .git/**/*
    - '*.md'
  name: myapp-$(date +%Y-%m-%d)

Create CodeBuild Project

aws codebuild create-project \
    --name "myapp-build" \
    --source type=CODEPIPELINE \
    --artifacts type=CODEPIPELINE \
    --environment type=LINUX_CONTAINER,image=aws/codebuild/amazonlinux2-x86_64-standard:3.0,computeType=BUILD_GENERAL1_MEDIUM \
    --service-role arn:aws:iam::123456789012:role/CodeBuildServiceRole

Step 4: CodeDeploy Applications and Deployment Groups

Create CodeDeploy Application

aws deploy create-application \
    --application-name myapp \
    --compute-platform Server

Create Deployment Groups

Development Environment:

aws deploy create-deployment-group \
    --application-name myapp \
    --deployment-group-name development \
    --service-role-arn arn:aws:iam::123456789012:role/CodeDeployServiceRole \
    --ec2-tag-filters Type=KEY_AND_VALUE,Key=Environment,Value=Development \
    --deployment-config-name CodeDeployDefault.AllAtOne

Staging Environment:

aws deploy create-deployment-group \
    --application-name myapp \
    --deployment-group-name staging \
    --service-role-arn arn:aws:iam::123456789012:role/CodeDeployServiceRole \
    --ec2-tag-filters Type=KEY_AND_VALUE,Key=Environment,Value=Staging \
    --deployment-config-name CodeDeployDefault.AllAtOne

Production Environment (Blue/Green):

aws deploy create-deployment-group \
    --application-name myapp \
    --deployment-group-name production \
    --service-role-arn arn:aws:iam::123456789012:role/CodeDeployServiceRole \
    --blue-green-deployment-configuration '{
        "terminateBlueInstancesOnDeploymentSuccess": {
            "action": "TERMINATE",
            "terminationWaitTimeInMinutes": 5
        },
        "deploymentReadyOption": {
            "actionOnTimeout": "CONTINUE_DEPLOYMENT"
        },
        "greenFleetProvisioningOption": {
            "action": "COPY_AUTO_SCALING_GROUP"
        }
    }' \
    --load-balancer-info targetGroupInfoList='[{
        "name": "production-blue-tg"
    }]' \
    --deployment-config-name CodeDeployDefault.BlueGreenAllAtOnce

Step 5: Application Configuration Files

AppSpec File

Create an appspec.yml file for CodeDeploy:

version: 0.0
os: linux
files:
  - source: /
    destination: /var/www/myapp
    overwrite: yes
permissions:
  - object: /var/www/myapp
    owner: ec2-user
    group: ec2-user
    mode: 755
hooks:
  BeforeInstall:
    - location: scripts/install_dependencies.sh
      timeout: 300
      runas: root
  ApplicationStart:
    - location: scripts/start_server.sh
      timeout: 300
      runas: ec2-user
  ApplicationStop:
    - location: scripts/stop_server.sh
      timeout: 300
      runas: ec2-user
  ValidateService:
    - location: scripts/validate_service.sh
      timeout: 300
      runas: ec2-user

Deployment Scripts

Create a scripts/ directory with the following files:

scripts/install_dependencies.sh:

#!/bin/bash
cd /var/www/myapp
npm install --production

scripts/start_server.sh:

#!/bin/bash
cd /var/www/myapp
pm2 stop all
pm2 start ecosystem.config.js --env production

scripts/stop_server.sh:

#!/bin/bash
pm2 stop all

scripts/validate_service.sh:

#!/bin/bash
# Wait for the application to start
sleep 30

# Check if the application is responding
curl -f http://localhost:3000/health
if [ $? -eq 0 ]; then
    echo "Application is running successfully"
    exit 0
else
    echo "Application failed to start"
    exit 1
fi

Step 6: Create the CodePipeline

Pipeline Configuration

{
  "pipeline": {
    "name": "myapp-production-pipeline",
    "roleArn": "arn:aws:iam::123456789012:role/CodePipelineServiceRole",
    "artifactStore": {
      "type": "S3",
      "location": "your-company-codepipeline-artifacts-bucket"
    },
    "stages": [
      {
        "name": "Source",
        "actions": [
          {
            "name": "Source",
            "actionTypeId": {
              "category": "Source",
              "owner": "ThirdParty",
              "provider": "GitHub",
              "version": "1"
            },
            "configuration": {
              "Owner": "your-github-username",
              "Repo": "your-repo-name",
              "Branch": "main",
              "OAuthToken": "{{resolve:secretsmanager:github-oauth-token}}"
            },
            "outputArtifacts": [
              {
                "name": "SourceOutput"
              }
            ]
          }
        ]
      },
      {
        "name": "Build",
        "actions": [
          {
            "name": "Build",
            "actionTypeId": {
              "category": "Build",
              "owner": "AWS",
              "provider": "CodeBuild",
              "version": "1"
            },
            "configuration": {
              "ProjectName": "myapp-build"
            },
            "inputArtifacts": [
              {
                "name": "SourceOutput"
              }
            ],
            "outputArtifacts": [
              {
                "name": "BuildOutput"
              }
            ]
          }
        ]
      },
      {
        "name": "DeployToDev",
        "actions": [
          {
            "name": "Deploy",
            "actionTypeId": {
              "category": "Deploy",
              "owner": "AWS",
              "provider": "CodeDeploy",
              "version": "1"
            },
            "configuration": {
              "ApplicationName": "myapp",
              "DeploymentGroupName": "development"
            },
            "inputArtifacts": [
              {
                "name": "BuildOutput"
              }
            ]
          }
        ]
      },
      {
        "name": "ApprovalForStaging",
        "actions": [
          {
            "name": "ManualApproval",
            "actionTypeId": {
              "category": "Approval",
              "owner": "AWS",
              "provider": "Manual",
              "version": "1"
            },
            "configuration": {
              "CustomData": "Please review the development deployment and approve for staging"
            }
          }
        ]
      },
      {
        "name": "DeployToStaging",
        "actions": [
          {
            "name": "Deploy",
            "actionTypeId": {
              "category": "Deploy",
              "owner": "AWS",
              "provider": "CodeDeploy",
              "version": "1"
            },
            "configuration": {
              "ApplicationName": "myapp",
              "DeploymentGroupName": "staging"
            },
            "inputArtifacts": [
              {
                "name": "BuildOutput"
              }
            ]
          }
        ]
      },
      {
        "name": "StagingTests",
        "actions": [
          {
            "name": "IntegrationTests",
            "actionTypeId": {
              "category": "Build",
              "owner": "AWS",
              "provider": "CodeBuild",
              "version": "1"
            },
            "configuration": {
              "ProjectName": "myapp-integration-tests"
            },
            "inputArtifacts": [
              {
                "name": "SourceOutput"
              }
            ]
          }
        ]
      },
      {
        "name": "ApprovalForProduction",
        "actions": [
          {
            "name": "ManualApproval",
            "actionTypeId": {
              "category": "Approval",
              "owner": "AWS",
              "provider": "Manual",
              "version": "1"
            },
            "configuration": {
              "CustomData": "Please review staging tests and approve for production deployment"
            }
          }
        ]
      },
      {
        "name": "DeployToProduction",
        "actions": [
          {
            "name": "Deploy",
            "actionTypeId": {
              "category": "Deploy",
              "owner": "AWS",
              "provider": "CodeDeploy",
              "version": "1"
            },
            "configuration": {
              "ApplicationName": "myapp",
              "DeploymentGroupName": "production"
            },
            "inputArtifacts": [
              {
                "name": "BuildOutput"
              }
            ]
          }
        ]
      }
    ]
  }
}

Create the Pipeline

aws codepipeline create-pipeline --cli-input-json file://pipeline-config.json

Step 7: Production Considerations

Monitoring and Alerting

Set up CloudWatch alarms for pipeline failures:

aws cloudwatch put-metric-alarm \
    --alarm-name "CodePipeline-Failure" \
    --alarm-description "Alert on pipeline failure" \
    --metric-name PipelineExecutionFailure \
    --namespace AWS/CodePipeline \
    --statistic Sum \
    --period 300 \
    --threshold 1 \
    --comparison-operator GreaterThanOrEqualToThreshold \
    --dimensions Name=PipelineName,Value=myapp-production-pipeline \
    --alarm-actions arn:aws:sns:us-east-1:123456789012:pipeline-alerts

Rollback Strategy

Implement automatic rollback capabilities:

# In buildspec.yml, add rollback script generation
post_build:
  commands:
    - echo "Generating rollback script..."
    - |
      cat > rollback.sh << 'EOF'
      #!/bin/bash
      aws deploy stop-deployment --deployment-id $1 --auto-rollback-enabled
      EOF
    - chmod +x rollback.sh

Security Best Practices

Use AWS Secrets Manager for sensitive configuration:

aws secretsmanager create-secret \
    --name myapp/production/database \
    --description "Production database credentials" \
    --secret-string '{"username":"admin","password":"securepassword"}'

Implement least privilege IAM policies
Enable AWS CloudTrail for audit logging
Use VPC endpoints for secure communication between services

Performance Optimization

Use CodeBuild cache to speed up builds:

# In buildspec.yml
cache:
  paths:
    - '/root/.npm/**/*'
    - 'node_modules/**/*'

Implement parallel deployments for multiple environments
Use CodeDeploy deployment configurations for optimized rollout strategies

Disaster Recovery

Cross-region artifact replication:

aws s3api put-bucket-replication \
    --bucket your-company-codepipeline-artifacts-bucket \
    --replication-configuration file://replication-config.json

Automated backup of deployment configurations
Multi-region deployment capabilities

Step 8: Testing the Pipeline

Initial Deployment

Push code to your GitHub repository
Monitor the pipeline execution in the AWS Console
Verify each stage completes successfully
Test the deployed application in each environment

Validate Blue/Green Deployment

Make a code change and push to repository
Approve the production deployment
Verify traffic switches to green environment
Confirm old blue instances are terminated

Troubleshooting Common Issues

CodeDeploy Agent Issues

# Check agent status
sudo service codedeploy-agent status

# View agent logs
sudo tail -f /var/log/aws/codedeploy-agent/codedeploy-agent.log

Permission Issues

Verify IAM roles have correct policies attached
Check S3 bucket policies allow pipeline access
Ensure EC2 instances have proper instance profiles

Deployment Failures

Review CodeDeploy deployment logs in CloudWatch
Check application logs on target instances
Verify health check endpoints are responding

Conclusion

This production-ready AWS release pipeline provides a robust foundation for enterprise deployments. Key benefits include:

Zero-downtime deployments through blue/green strategies
Multiple environment promotion with manual approvals
Comprehensive monitoring and alerting
Automated rollback capabilities
Security best practices implementation

Remember to regularly review and update your pipeline configuration, monitor performance metrics, and continuously improve your deployment processes based on team feedback and operational requirements.

The pipeline can be extended with additional features such as automated security scanning, performance testing, and integration with other AWS services as your requirements evolve.

admin July 17, 2025 AWSNo Comments »

admin | July 17, 2025

Mastering Ultra-Low Latency Systems: A Deep Dive into Bare-Metal Performance

In the world of high-frequency trading, real-time systems, and mission-critical applications, every nanosecond matters. This comprehensive guide explores the art and science of building ultra-low latency systems that push hardware to its absolute limits.

Understanding the Foundations

Ultra-low latency systems demand a holistic approach to performance optimization. We’re talking about achieving deterministic execution with sub-microsecond response times, zero packet loss, and minimal jitter. This requires deep control over every layer of the stack—from hardware configuration to kernel parameters.

Kernel Tuning and Real-Time Schedulers

The Linux kernel’s default configuration is designed for general-purpose computing, not deterministic real-time performance. Here’s how to transform it into a precision instrument.

Enabling Real-Time Kernel


# Install RT kernel
sudo apt-get install linux-image-rt-amd64 linux-headers-rt-amd64

# Verify RT kernel is active
uname -a | grep PREEMPT_RT

# Set real-time scheduler priorities
sudo chrt -f -p 99

Critical Kernel Parameters


# /etc/sysctl.conf - Core kernel tuning
kernel.sched_rt_runtime_us = -1
kernel.sched_rt_period_us = 1000000
vm.swappiness = 1
vm.dirty_ratio = 5
vm.dirty_background_ratio = 2
net.core.busy_read = 50
net.core.busy_poll = 50

Boot Parameters for Maximum Performance


# /etc/default/grub
GRUB_CMDLINE_LINUX="isolcpus=2-15 nohz_full=2-15 rcu_nocbs=2-15 \
    intel_idle.max_cstate=0 processor.max_cstate=0 intel_pstate=disable \
    nosoftlockup nmi_watchdog=0 mce=off rcu_nocb_poll"

CPU Affinity and IRQ Routing

Controlling where processes run and how interrupts are handled is crucial for consistent performance.

CPU Isolation and Affinity


# Check current CPU topology
lscpu --extended

# Bind process to specific CPU core
taskset -c 4 ./high_frequency_app

# Set CPU affinity for running process
taskset -cp 4-7 $(pgrep trading_engine)

# Verify affinity
taskset -p $(pgrep trading_engine)

IRQ Routing and Optimization


# View current IRQ assignments
cat /proc/interrupts

# Route network IRQ to specific CPU
echo 4 > /proc/irq/24/smp_affinity_list

# Disable IRQ balancing daemon
sudo service irqbalance stop
sudo systemctl disable irqbalance

# Manual IRQ distribution script
#!/bin/bash
for irq in $(grep eth0 /proc/interrupts | cut -d: -f1); do
    echo $((irq % 4 + 4)) > /proc/irq/$irq/smp_affinity_list
done

Network Stack Optimization

Network performance is often the bottleneck in ultra-low latency systems. Here’s how to optimize every layer.

TCP/IP Stack Tuning


# Network buffer optimization
echo 'net.core.rmem_max = 134217728' >> /etc/sysctl.conf
echo 'net.core.wmem_max = 134217728' >> /etc/sysctl.conf
echo 'net.ipv4.tcp_rmem = 4096 87380 134217728' >> /etc/sysctl.conf
echo 'net.ipv4.tcp_wmem = 4096 65536 134217728' >> /etc/sysctl.conf

# Reduce TCP overhead
echo 'net.ipv4.tcp_timestamps = 0' >> /etc/sysctl.conf
echo 'net.ipv4.tcp_sack = 0' >> /etc/sysctl.conf
echo 'net.core.netdev_max_backlog = 30000' >> /etc/sysctl.conf

Network Interface Configuration


# Maximize ring buffer sizes
ethtool -G eth0 rx 4096 tx 4096

# Disable interrupt coalescing
ethtool -C eth0 adaptive-rx off adaptive-tx off rx-usecs 0 tx-usecs 0

# Enable multiqueue
ethtool -L eth0 combined 8

# Set CPU affinity for network interrupts
echo 2 > /sys/class/net/eth0/queues/rx-0/rps_cpus

NUMA Policies and Memory Optimization

Non-Uniform Memory Access (NUMA) awareness is critical for consistent performance across multi-socket systems.

NUMA Configuration


# Check NUMA topology
numactl --hardware

# Run application on specific NUMA node
numactl --cpunodebind=0 --membind=0 ./trading_app

# Set memory policy for huge pages
echo 1024 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages

Memory Allocator Optimization


# Configure transparent huge pages
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

# Memory locking and preallocation
ulimit -l unlimited
echo 'vm.max_map_count = 262144' >> /etc/sysctl.conf

Kernel Bypass and DPDK

For ultimate performance, bypass the kernel networking stack entirely.

DPDK (Data Plane Development Kit) lets applications access NIC hardware directly in user space, slashing latency from microseconds to nanoseconds.

DPDK Setup


# Install DPDK
wget https://fast.dpdk.org/rel/dpdk-21.11.tar.xz
tar xf dpdk-21.11.tar.xz
cd dpdk-21.11
meson build
cd build && ninja

# Bind NIC to DPDK driver
./usertools/dpdk-devbind.py --bind=vfio-pci 0000:02:00.0

# Configure huge pages for DPDK
echo 1024 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
mkdir /mnt/huge
mount -t hugetlbfs nodev /mnt/huge

Conclusion

Building ultra-low latency systems requires expertise across hardware, kernel, and application layers. The techniques outlined here form the foundation for achieving deterministic performance in the most demanding environments. Remember: measure everything, question assumptions, and never accept “good enough” when nanoseconds matter.

The key to success is systematic optimization, rigorous testing, and continuous monitoring. Master these techniques, and you’ll be equipped to build systems that push the boundaries of what’s possible in real-time computing.

admin July 17, 2025 Kernel Stuff, LinuxNo Comments »

admin | July 16, 2025

Building Production-Ready Release Pipelines in Azure: A Step-by-Step Guide using Arm Templates

Creating enterprise-grade release pipelines in Azure requires a comprehensive understanding of Azure DevOps services, proper configuration, and adherence to production best practices. This detailed guide will walk you through building a robust CI/CD pipeline that deploys applications to Azure App Services with slot-based deployments for zero-downtime releases.

Architecture Overview

Our production pipeline will deploy a .NET web application to Azure App Service using deployment slots for blue/green deployments. The pipeline includes multiple environments (development, staging, production) with automated testing, security scanning, and manual approval gates.

Pipeline Flow:

Azure Repos → Build Pipeline (Azure Pipelines) → Dev Deployment → Automated Tests → Staging Deployment → Security Scan → Manual Approval → Production Deployment (Slot Swap) → Post-Deployment Monitoring

Prerequisites

Before starting, ensure you have:

Azure subscription with sufficient permissions
Azure DevOps organization and project
.NET application in Azure Repos (or GitHub)
Understanding of Azure Resource Manager (ARM) templates
Azure CLI installed locally

Understanding Azure Deployment Slots

Before diving into infrastructure setup, it’s crucial to understand Azure deployment slots – a key feature that enables zero-downtime deployments and advanced deployment strategies.

What Are Deployment Slots?

Azure App Service deployment slots are live instances of your web application with their own hostnames. Think of them as separate environments that share the same App Service plan but can run different versions of your application.

Production slot: Your main application (e.g., myapp.azurewebsites.net)
Staging slot: A separate instance (e.g., myapp-staging.azurewebsites.net)
Additional slots: Canary, testing, or feature-specific environments

Why Use Deployment Slots?

Zero-Downtime Deployment Process:
1. Deploy new version to staging slot
2. Test the staging slot thoroughly
3. Swap staging and production slots instantly
4. If issues arise, swap back immediately (rollback)

Key Benefits:

Zero-downtime deployments: Instant traffic switching
Blue/green deployments: Run two versions simultaneously
A/B testing: Route percentage of traffic to different versions
Warm-up validation: Test in production environment before going live
Quick rollbacks: Instant revert if problems occur

Important: Deployment slots are only available in Standard, Premium, and Isolated App Service plan tiers. They’re not available in Free or Basic tiers.

Step 1: Infrastructure Setup – Choose Your Approach

Azure offers two primary Infrastructure as Code (IaC) approaches for managing resources including deployment slots:

ARM Templates/Bicep: Azure’s native IaC solution
Terraform: Multi-cloud infrastructure management tool

Option A: ARM Templates/Bicep (Recommended for Azure-only environments)

Create an ARM template (infrastructure/main.bicep) for your infrastructure:

param location string = resourceGroup().location
param environmentName string
param appServicePlanSku string = 'S1'

resource appServicePlan 'Microsoft.Web/serverfarms@2022-03-01' = {
  name: 'asp-myapp-${environmentName}'
  location: location
  sku: {
    name: appServicePlanSku
    tier: 'Standard'
  }
  properties: {
    reserved: false
  }
}

resource webApp 'Microsoft.Web/sites@2022-03-01' = {
  name: 'app-myapp-${environmentName}'
  location: location
  properties: {
    serverFarmId: appServicePlan.id
    httpsOnly: true
    siteConfig: {
      netFrameworkVersion: 'v6.0'
      defaultDocuments: [
        'Default.htm'
        'Default.html'
        'index.html'
      ]
      httpLoggingEnabled: true
      logsDirectorySizeLimit: 35
      detailedErrorLoggingEnabled: true
      appSettings: [
        {
          name: 'ASPNETCORE_ENVIRONMENT'
          value: environmentName
        }
        {
          name: 'ApplicationInsights__ConnectionString'
          value: applicationInsights.properties.ConnectionString
        }
      ]
    }
  }
}

// Create staging slot for production environment
resource stagingSlot 'Microsoft.Web/sites/slots@2022-03-01' = if (environmentName == 'prod') {
  parent: webApp
  name: 'staging'
  location: location
  properties: {
    serverFarmId: appServicePlan.id
    httpsOnly: true
    siteConfig: {
      netFrameworkVersion: 'v6.0'
      appSettings: [
        {
          name: 'ASPNETCORE_ENVIRONMENT'
          value: 'Staging'
        }
        {
          name: 'ApplicationInsights__ConnectionString'
          value: applicationInsights.properties.ConnectionString
        }
      ]
    }
  }
}

resource applicationInsights 'Microsoft.Insights/components@2020-02-02' = {
  name: 'ai-myapp-${environmentName}'
  location: location
  kind: 'web'
  properties: {
    Application_Type: 'web'
    Request_Source: 'rest'
    RetentionInDays: 90
    WorkspaceResourceId: logAnalyticsWorkspace.id
  }
}

resource logAnalyticsWorkspace 'Microsoft.OperationalInsights/workspaces@2022-10-01' = {
  name: 'log-myapp-${environmentName}'
  location: location
  properties: {
    sku: {
      name: 'PerGB2018'
    }
    retentionInDays: 30
  }
}

resource keyVault 'Microsoft.KeyVault/vaults@2022-07-01' = {
  name: 'kv-myapp-${environmentName}-${uniqueString(resourceGroup().id)}'
  location: location
  properties: {
    sku: {
      family: 'A'
      name: 'standard'
    }
    tenantId: subscription().tenantId
    accessPolicies: [
      {
        tenantId: subscription().tenantId
        objectId: webApp.identity.principalId
        permissions: {
          secrets: [
            'get'
            'list'
          ]
        }
      }
    ]
    enableRbacAuthorization: false
    enableSoftDelete: true
    softDeleteRetentionInDays: 7
  }
}

output webAppName string = webApp.name
output webAppUrl string = 'https://${webApp.properties.defaultHostName}'
output keyVaultName string = keyVault.name
output applicationInsightsKey string = applicationInsights.properties.InstrumentationKey

Deploy Infrastructure

Create infrastructure deployment pipeline (infrastructure/azure-pipelines.yml):

trigger: none

variables:
  azureSubscription: 'MyAzureSubscription'
  resourceGroupPrefix: 'rg-myapp'
  location: 'East US'

stages:
- stage: DeployInfrastructure
  displayName: 'Deploy Infrastructure'
  jobs:
  - job: DeployDev
    displayName: 'Deploy Development Infrastructure'
    pool:
      vmImage: 'ubuntu-latest'
    steps:
    - task: AzureResourceManagerTemplateDeployment@3
      displayName: 'Deploy Development Resources'
      inputs:
        deploymentScope: 'Resource Group'
        azureResourceManagerConnection: '$(azureSubscription)'
        subscriptionId: '$(subscriptionId)'
        action: 'Create Or Update Resource Group'
        resourceGroupName: '$(resourceGroupPrefix)-dev'
        location: '$(location)'
        templateLocation: 'Linked artifact'
        csmFile: 'infrastructure/main.bicep'
        overrideParameters: |
          -environmentName "dev"
          -appServicePlanSku "F1"
        deploymentMode: 'Incremental'

  - job: DeployStaging
    displayName: 'Deploy Staging Infrastructure'
    pool:
      vmImage: 'ubuntu-latest'
    steps:
    - task: AzureResourceManagerTemplateDeployment@3
      displayName: 'Deploy Staging Resources'
      inputs:
        deploymentScope: 'Resource Group'
        azureResourceManagerConnection: '$(azureSubscription)'
        subscriptionId: '$(subscriptionId)'
        action: 'Create Or Update Resource Group'
        resourceGroupName: '$(resourceGroupPrefix)-staging'
        location: '$(location)'
        templateLocation: 'Linked artifact'
        csmFile: 'infrastructure/main.bicep'
        overrideParameters: |
          -environmentName "staging"
          -appServicePlanSku "S1"
        deploymentMode: 'Incremental'

  - job: DeployProduction
    displayName: 'Deploy Production Infrastructure'
    pool:
      vmImage: 'ubuntu-latest'
    steps:
    - task: AzureResourceManagerTemplateDeployment@3
      displayName: 'Deploy Production Resources'
      inputs:
        deploymentScope: 'Resource Group'
        azureResourceManagerConnection: '$(azureSubscription)'
        subscriptionId: '$(subscriptionId)'
        action: 'Create Or Update Resource Group'
        resourceGroupName: '$(resourceGroupPrefix)-prod'
        location: '$(location)'
        templateLocation: 'Linked artifact'
        csmFile: 'infrastructure/main.bicep'
        overrideParameters: |
          -environmentName "prod"
          -appServicePlanSku "P1V2"
        deploymentMode: 'Incremental'

Option B: Terraform (Recommended for multi-cloud or Terraform-experienced teams)

Alternatively, you can use Terraform to manage the same infrastructure. Here’s the equivalent Terraform configuration:

main.tf:

terraform {
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~>3.0"
    }
  }
}

provider "azurerm" {
  features {}
}

# Resource Group
resource "azurerm_resource_group" "main" {
  name     = "rg-myapp-${var.environment_name}"
  location = var.location
}

# App Service Plan
resource "azurerm_service_plan" "main" {
  name                = "asp-myapp-${var.environment_name}"
  resource_group_name = azurerm_resource_group.main.name
  location           = azurerm_resource_group.main.location
  os_type            = "Windows"
  sku_name           = var.app_service_plan_sku
}

# Main Web App (Production Slot)
resource "azurerm_windows_web_app" "main" {
  name                = "app-myapp-${var.environment_name}"
  resource_group_name = azurerm_resource_group.main.name
  location           = azurerm_service_plan.main.location
  service_plan_id    = azurerm_service_plan.main.id

  site_config {
    always_on = true
    
    application_stack {
      dotnet_framework_version = "v6.0"
    }
  }

  app_settings = {
    "ASPNETCORE_ENVIRONMENT" = title(var.environment_name)
    "ApplicationInsights__ConnectionString" = azurerm_application_insights.main.connection_string
  }

  identity {
    type = "SystemAssigned"
  }

  https_only = true
}

# Staging Deployment Slot (only for production environment)
resource "azurerm_windows_web_app_slot" "staging" {
  count           = var.environment_name == "prod" ? 1 : 0
  name            = "staging"
  app_service_id  = azurerm_windows_web_app.main.id

  site_config {
    always_on = true
    
    application_stack {
      dotnet_framework_version = "v6.0"
    }
  }

  app_settings = {
    "ASPNETCORE_ENVIRONMENT" = "Staging"
    "ApplicationInsights__ConnectionString" = azurerm_application_insights.main.connection_string
  }

  identity {
    type = "SystemAssigned"
  }

  https_only = true
}

# Application Insights
resource "azurerm_application_insights" "main" {
  name                = "ai-myapp-${var.environment_name}"
  location           = azurerm_resource_group.main.location
  resource_group_name = azurerm_resource_group.main.name
  application_type   = "web"
  retention_in_days  = 90
  workspace_id       = azurerm_log_analytics_workspace.main.id
}

# Log Analytics Workspace
resource "azurerm_log_analytics_workspace" "main" {
  name                = "log-myapp-${var.environment_name}"
  location           = azurerm_resource_group.main.location
  resource_group_name = azurerm_resource_group.main.name
  sku                = "PerGB2018"
  retention_in_days  = 30
}

# Key Vault for secrets
resource "azurerm_key_vault" "main" {
  name                = "kv-myapp-${var.environment_name}-${random_string.suffix.result}"
  location           = azurerm_resource_group.main.location
  resource_group_name = azurerm_resource_group.main.name
  tenant_id          = data.azurerm_client_config.current.tenant_id
  sku_name           = "standard"

  # Grant access to the web app's managed identity
  access_policy {
    tenant_id = data.azurerm_client_config.current.tenant_id
    object_id = azurerm_windows_web_app.main.identity[0].principal_id

    secret_permissions = [
      "Get",
      "List",
    ]
  }

  # Grant access to staging slot if it exists
  dynamic "access_policy" {
    for_each = var.environment_name == "prod" ? [1] : []
    content {
      tenant_id = data.azurerm_client_config.current.tenant_id
      object_id = azurerm_windows_web_app_slot.staging[0].identity[0].principal_id

      secret_permissions = [
        "Get",
        "List",
      ]
    }
  }
}

resource "random_string" "suffix" {
  length  = 8
  special = false
  upper   = false
}

data "azurerm_client_config" "current" {}

variables.tf:

variable "environment_name" {
  description = "Environment name"
  type        = string
  validation {
    condition     = contains(["dev", "staging", "prod"], var.environment_name)
    error_message = "Environment must be dev, staging, or prod."
  }
}

variable "location" {
  description = "Azure region"
  type        = string
  default     = "East US"
}

variable "app_service_plan_sku" {
  description = "App Service Plan SKU"
  type        = string
  default     = "S1"
}

terraform.tfvars (for different environments):

# terraform.tfvars.prod
environment_name = "prod"
location = "East US"
app_service_plan_sku = "P1V2"  # Production tier supports deployment slots

# terraform.tfvars.staging  
environment_name = "staging"
location = "East US"
app_service_plan_sku = "S1"  # No slots needed for staging environment

# terraform.tfvars.dev
environment_name = "dev"
location = "East US"
app_service_plan_sku = "F1"  # Free tier, no slots available

Deploy with Terraform:

# Initialize Terraform
terraform init

# Plan deployment
terraform plan -var-file="terraform.tfvars.prod"

# Apply infrastructure
terraform apply -var-file="terraform.tfvars.prod" -auto-approve

ARM vs Terraform: Which Should You Choose?

Choose ARM Templates/Bicep if:

You’re working in a pure Azure environment
Your team is Azure-focused
You want native Azure tooling integration
You need immediate access to new Azure features

Choose Terraform if:

You have multi-cloud infrastructure
Your team has Terraform expertise
You want vendor-neutral infrastructure code
You need to manage non-Azure resources (DNS, monitoring tools, etc.)

Deploy Infrastructure

If using ARM/Bicep, create infrastructure deployment pipeline (infrastructure/azure-pipelines.yml):

trigger: none

variables:
  azureSubscription: 'MyAzureSubscription'
  resourceGroupPrefix: 'rg-myapp'
  location: 'East US'

stages:
- stage: DeployInfrastructure
  displayName: 'Deploy Infrastructure'
  jobs:
  - job: DeployDev
    displayName: 'Deploy Development Infrastructure'
    pool:
      vmImage: 'ubuntu-latest'
    steps:
    - task: AzureResourceManagerTemplateDeployment@3
      displayName: 'Deploy Development Resources'
      inputs:
        deploymentScope: 'Resource Group'
        azureResourceManagerConnection: '$(azureSubscription)'
        subscriptionId: '$(subscriptionId)'
        action: 'Create Or Update Resource Group'
        resourceGroupName: '$(resourceGroupPrefix)-dev'
        location: '$(location)'
        templateLocation: 'Linked artifact'
        csmFile: 'infrastructure/main.bicep'
        overrideParameters: |
          -environmentName "dev"
          -appServicePlanSku "F1"
        deploymentMode: 'Incremental'

  - job: DeployStaging
    displayName: 'Deploy Staging Infrastructure'
    pool:
      vmImage: 'ubuntu-latest'
    steps:
    - task: AzureResourceManagerTemplateDeployment@3
      displayName: 'Deploy Staging Resources'
      inputs:
        deploymentScope: 'Resource Group'
        azureResourceManagerConnection: '$(azureSubscription)'
        subscriptionId: '$(subscriptionId)'
        action: 'Create Or Update Resource Group'
        resourceGroupName: '$(resourceGroupPrefix)-staging'
        location: '$(location)'
        templateLocation: 'Linked artifact'
        csmFile: 'infrastructure/main.bicep'
        overrideParameters: |
          -environmentName "staging"
          -appServicePlanSku "S1"
        deploymentMode: 'Incremental'

  - job: DeployProduction
    displayName: 'Deploy Production Infrastructure'
    pool:
      vmImage: 'ubuntu-latest'
    steps:
    - task: AzureResourceManagerTemplateDeployment@3
      displayName: 'Deploy Production Resources'
      inputs:
        deploymentScope: 'Resource Group'
        azureResourceManagerConnection: '$(azureSubscription)'
        subscriptionId: '$(subscriptionId)'
        action: 'Create Or Update Resource Group'
        resourceGroupName: '$(resourceGroupPrefix)-prod'
        location: '$(location)'
        templateLocation: 'Linked artifact'
        csmFile: 'infrastructure/main.bicep'
        overrideParameters: |
          -environmentName "prod"
          -appServicePlanSku "P1V2"

Application Settings


    
Create environment-specific configuration files:

    
appsettings.Development.json:

{
  "Logging": {
    "LogLevel": {
      "Default": "Information",
      "Microsoft.AspNetCore": "Warning"
    }
  },
  "ConnectionStrings": {
    "DefaultConnection": "@Microsoft.KeyVault(SecretUri=https://kv-myapp-dev.vault.azure.net/secrets/DatabaseConnectionString/)"
  },
  "ApplicationInsights": {
    "ConnectionString": ""
  }
}

appsettings.Staging.json:

{
  "Logging": {
    "LogLevel": {
      "Default": "Information",
      "Microsoft.AspNetCore": "Warning"
    }
  },
  "ConnectionStrings": {
    "DefaultConnection": "@Microsoft.KeyVault(SecretUri=https://kv-myapp-staging.vault.azure.net/secrets/DatabaseConnectionString/)"
  },
  "ApplicationInsights": {
    "ConnectionString": ""
  }
}

appsettings.Production.json:

{
  "Logging": {
    "LogLevel": {
      "Default": "Warning",
      "Microsoft.AspNetCore": "Warning"
    }
  },
  "ConnectionStrings": {
    "DefaultConnection": "@Microsoft.KeyVault(SecretUri=https://kv-myapp-prod.vault.azure.net/secrets/DatabaseConnectionString/)"
  },
  "ApplicationInsights": {
    "ConnectionString": ""
  }
}

Health Check Configuration

Add health checks to your application (Program.cs):

var builder = WebApplication.CreateBuilder(args);

// Add services
builder.Services.AddControllers();
builder.Services.AddApplicationInsightsTelemetry();
builder.Services.AddHealthChecks()
    .AddCheck("self", () => HealthCheckResult.Healthy())
    .AddSqlServer(
        builder.Configuration.GetConnectionString("DefaultConnection"),
        name: "database",
        tags: new[] { "db", "sql", "sqlserver" });

var app = builder.Build();

// Configure pipeline
if (!app.Environment.IsDevelopment())
{
    app.UseExceptionHandler("/Error");
    app.UseHsts();
}

app.UseHttpsRedirection();
app.UseStaticFiles();
app.UseRouting();
app.UseAuthorization();

app.MapControllers();
app.MapHealthChecks("/health", new HealthCheckOptions
{
    ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse
});

app.MapHealthChecks("/health/ready", new HealthCheckOptions
{
    Predicate = check => check.Tags.Contains("ready"),
    ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse
});

app.MapHealthChecks("/health/live", new HealthCheckOptions
{
    Predicate = _ => false,
    ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse
});

app.Run();

Step 3: Build Pipeline Configuration

Create the main build pipeline (azure-pipelines.yml):

trigger:
  branches:
    include:
    - main
    - develop
  paths:
    exclude:
    - infrastructure/*
    - docs/*
    - README.md

variables:
  buildConfiguration: 'Release'
  dotNetFramework: 'net6.0'
  dotNetVersion: '6.0.x'
  buildPlatform: 'Any CPU'

pool:
  vmImage: 'windows-latest'

stages:
- stage: Build
  displayName: 'Build and Test'
  jobs:
  - job: BuildJob
    displayName: 'Build Job'
    steps:
    - task: UseDotNet@2
      displayName: 'Use .NET Core SDK $(dotNetVersion)'
      inputs:
        packageType: 'sdk'
        version: '$(dotNetVersion)'

    - task: DotNetCoreCLI@2
      displayName: 'Restore NuGet packages'
      inputs:
        command: 'restore'
        projects: '**/*.csproj'
        feedsToUse: 'select'

    - task: DotNetCoreCLI@2
      displayName: 'Build application'
      inputs:
        command: 'build'
        projects: '**/*.csproj'
        arguments: '--configuration $(buildConfiguration) --no-restore'

    - task: DotNetCoreCLI@2
      displayName: 'Run unit tests'
      inputs:
        command: 'test'
        projects: '**/*Tests.csproj'
        arguments: '--configuration $(buildConfiguration) --no-build --collect:"XPlat Code Coverage" --logger trx --results-directory $(Common.TestResultsDirectory)'
        publishTestResults: true

    - task: PublishCodeCoverageResults@1
      displayName: 'Publish code coverage'
      inputs:
        codeCoverageTool: 'Cobertura'
        summaryFileLocation: '$(Common.TestResultsDirectory)/**/*.cobertura.xml'

    - task: DotNetCoreCLI@2
      displayName: 'Publish application'
      inputs:
        command: 'publish'
        projects: '**/*.csproj'
        arguments: '--configuration $(buildConfiguration) --output $(Build.ArtifactStagingDirectory)/app --no-build'
        publishWebProjects: true
        zipAfterPublish: true

    - task: PublishBuildArtifacts@1
      displayName: 'Publish build artifacts'
      inputs:
        pathToPublish: '$(Build.ArtifactStagingDirectory)'
        artifactName: 'drop'
        publishLocation: 'Container'

- stage: DeployDev
  displayName: 'Deploy to Development'
  dependsOn: Build
  condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/develop'))
  variables:
    environment: 'dev'
    resourceGroup: 'rg-myapp-dev'
    webAppName: 'app-myapp-dev'
  jobs:
  - deployment: DeployDev
    displayName: 'Deploy to Development'
    environment: 'Development'
    strategy:
      runOnce:
        deploy:
          steps:
          - template: templates/deploy-steps.yml
            parameters:
              environment: '$(environment)'
              resourceGroup: '$(resourceGroup)'
              webAppName: '$(webAppName)'
              useSlots: false

- stage: DeployStaging
  displayName: 'Deploy to Staging'
  dependsOn: Build
  condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main'))
  variables:
    environment: 'staging'
    resourceGroup: 'rg-myapp-staging'
    webAppName: 'app-myapp-staging'
  jobs:
  - deployment: DeployStaging
    displayName: 'Deploy to Staging'
    environment: 'Staging'
    strategy:
      runOnce:
        deploy:
          steps:
          - template: templates/deploy-steps.yml
            parameters:
              environment: '$(environment)'
              resourceGroup: '$(resourceGroup)'
              webAppName: '$(webAppName)'
              useSlots: false

  - job: StagingTests
    displayName: 'Run Staging Tests'
    dependsOn: DeployStaging
    pool:
      vmImage: 'windows-latest'
    steps:
    - task: DotNetCoreCLI@2
      displayName: 'Run integration tests'
      inputs:
        command: 'test'
        projects: '**/*IntegrationTests.csproj'
        arguments: '--configuration $(buildConfiguration) --logger trx --results-directory $(Common.TestResultsDirectory)'
        publishTestResults: true
      env:
        TEST_BASE_URL: 'https://app-myapp-staging.azurewebsites.net'

- stage: SecurityScan
  displayName: 'Security Scanning'
  dependsOn: DeployStaging
  condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main'))
  jobs:
  - job: SecurityScan
    displayName: 'Security Scan'
    pool:
      vmImage: 'windows-latest'
    steps:
    - task: whitesource.ws-bolt.bolt.wss.WhiteSource Bolt@20
      displayName: 'WhiteSource Bolt'
      inputs:
        cwd: '$(System.DefaultWorkingDirectory)'

    - task: SonarCloudPrepare@1
      displayName: 'Prepare SonarCloud analysis'
      inputs:
        SonarCloud: 'SonarCloud'
        organization: 'your-organization'
        scannerMode: 'MSBuild'
        projectKey: 'myapp'
        projectName: 'MyApp'
        projectVersion: '$(Build.BuildNumber)'

    - task: DotNetCoreCLI@2
      displayName: 'Build for SonarCloud'
      inputs:
        command: 'build'
        projects: '**/*.csproj'
        arguments: '--configuration $(buildConfiguration)'

    - task: SonarCloudAnalyze@1
      displayName: 'Run SonarCloud analysis'

    - task: SonarCloudPublish@1
      displayName: 'Publish SonarCloud results'
      inputs:
        pollingTimeoutSec: '300'

- stage: ProductionApproval
  displayName: 'Production Approval'
  dependsOn: 
  - DeployStaging
  - SecurityScan
  condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main'))
  jobs:
  - job: waitForValidation
    displayName: 'Wait for external validation'
    pool: server
    timeoutInMinutes: 4320 # 3 days
    steps:
    - task: ManualValidation@0
      displayName: 'Manual validation'
      inputs:
        notifyUsers: |
          admin@company.com
          devops@company.com
        instructions: 'Please validate the staging deployment and approve for production'
        onTimeout: 'reject'

- stage: DeployProduction
  displayName: 'Deploy to Production'
  dependsOn: ProductionApproval
  condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main'))
  variables:
    environment: 'prod'
    resourceGroup: 'rg-myapp-prod'
    webAppName: 'app-myapp-prod'
  jobs:
  - deployment: DeployProduction
    displayName: 'Deploy to Production'
    environment: 'Production'
    strategy:
      runOnce:
        deploy:
          steps:
          - template: templates/deploy-steps.yml
            parameters:
              environment: '$(environment)'
              resourceGroup: '$(resourceGroup)'
              webAppName: '$(webAppName)'
              useSlots: true

Step 4: Deployment Templates

Create reusable deployment templates (templates/deploy-steps.yml):

parameters:
- name: environment
  type: string
- name: resourceGroup
  type: string
- name: webAppName
  type: string
- name: useSlots
  type: boolean
  default: false

steps:
- download: current
  artifact: drop
  displayName: 'Download build artifacts'

- task: AzureKeyVault@2
  displayName: 'Get secrets from Key Vault'
  inputs:
    azureSubscription: 'MyAzureSubscription'
    KeyVaultName: 'kv-myapp-${{ parameters.environment }}'
    SecretsFilter: '*'
    RunAsPreJob: false

- ${{ if eq(parameters.useSlots, true) }}:
  - task: AzureRmWebAppDeployment@4
    displayName: 'Deploy to staging slot'
    inputs:
      ConnectionType: 'AzureRM'
      azureSubscription: 'MyAzureSubscription'
      appType: 'webApp'
      WebAppName: '${{ parameters.webAppName }}'
      deployToSlotOrASE: true
      ResourceGroupName: '${{ parameters.resourceGroup }}'
      SlotName: 'staging'
      packageForLinux: '$(Pipeline.Workspace)/drop/app/*.zip'
      AppSettings: |
        -ASPNETCORE_ENVIRONMENT "${{ parameters.environment }}"
        -ApplicationInsights__ConnectionString "$(ApplicationInsights--ConnectionString)"
        -ConnectionStrings__DefaultConnection "$(DatabaseConnectionString)"

  - task: AzureAppServiceManage@0
    displayName: 'Start staging slot'
    inputs:
      azureSubscription: 'MyAzureSubscription'
      Action: 'Start Azure App Service'
      WebAppName: '${{ parameters.webAppName }}'
      SpecifySlotOrASE: true
      ResourceGroupName: '${{ parameters.resourceGroup }}'
      Slot: 'staging'

  - task: PowerShell@2
    displayName: 'Validate staging slot'
    inputs:
      targetType: 'inline'
      script: |
        $url = "https://${{ parameters.webAppName }}-staging.azurewebsites.net/health"
        Write-Host "Testing health endpoint: $url"
        
        $maxAttempts = 10
        $attempt = 0
        $success = $false
        
        while ($attempt -lt $maxAttempts -and -not $success) {
            try {
                $response = Invoke-RestMethod -Uri $url -Method Get -TimeoutSec 30
                if ($response) {
                    Write-Host "Health check passed!"
                    $success = $true
                } else {
                    Write-Host "Health check failed. Attempt $($attempt + 1) of $maxAttempts"
                }
            } catch {
                Write-Host "Error calling health endpoint: $($_.Exception.Message)"
            }
            
            if (-not $success) {
                Start-Sleep -Seconds 30
                $attempt++
            }
        }
        
        if (-not $success) {
            Write-Error "Health check failed after $maxAttempts attempts"
            exit 1
        }

  - task: AzureAppServiceManage@0
    displayName: 'Swap staging to production'
    inputs:
      azureSubscription: 'MyAzureSubscription'
      Action: 'Swap Slots'
      WebAppName: '${{ parameters.webAppName }}'
      ResourceGroupName: '${{ parameters.resourceGroup }}'
      SourceSlot: 'staging'
      SwapWithProduction: true

- ${{ if eq(parameters.useSlots, false) }}:
  - task: AzureRmWebAppDeployment@4
    displayName: 'Deploy to App Service'
    inputs:
      ConnectionType: 'AzureRM'
      azureSubscription: 'MyAzureSubscription'
      appType: 'webApp'
      WebAppName: '${{ parameters.webAppName }}'
      ResourceGroupName: '${{ parameters.resourceGroup }}'
      packageForLinux: '$(Pipeline.Workspace)/drop/app/*.zip'
      AppSettings: |
        -ASPNETCORE_ENVIRONMENT "${{ parameters.environment }}"
        -ApplicationInsights__ConnectionString "$(ApplicationInsights--ConnectionString)"
        -ConnectionStrings__DefaultConnection "$(DatabaseConnectionString)"

- task: PowerShell@2
  displayName: 'Post-deployment validation'
  inputs:
    targetType: 'inline'
    script: |
      $url = "https://${{ parameters.webAppName }}.azurewebsites.net/health"
      Write-Host "Testing production health endpoint: $url"
      
      $maxAttempts = 5
      $attempt = 0
      $success = $false
      
      while ($attempt -lt $maxAttempts -and -not $success) {
          try {
              $response = Invoke-RestMethod -Uri $url -Method Get -TimeoutSec 30
              if ($response) {
                  Write-Host "Production health check passed!"
                  $success = $true
              }
          } catch {
              Write-Host "Error calling production health endpoint: $($_.Exception.Message)"
          }
          
          if (-not $success) {
              Start-Sleep -Seconds 15
              $attempt++
          }
      }
      
      if (-not $success) {
          Write-Error "Production health check failed after $maxAttempts attempts"
          exit 1
      }

- task: AzureCLI@2
  displayName: 'Configure monitoring alerts'
  inputs:
    azureSubscription: 'MyAzureSubscription'
    scriptType: 'ps'
    scriptLocation: 'inlineScript'
    inlineScript: |
      # Create action group for alerts
      az monitor action-group create `
        --name "myapp-alerts" `
        --resource-group "${{ parameters.resourceGroup }}" `
        --short-name "MyAppAlert" `
        --email-receivers name="DevOps Team" email="devops@company.com"
      
      # Create availability alert
      az monitor metrics alert create `
        --name "myapp-availability-alert" `
        --resource-group "${{ parameters.resourceGroup }}" `
        --scopes "/subscriptions/$(az account show --query id -o tsv)/resourceGroups/${{ parameters.resourceGroup }}/providers/Microsoft.Web/sites/${{ parameters.webAppName }}" `
        --condition "avg Availability < 99" `
        --description "Alert when availability drops below 99%" `
        --evaluation-frequency 1m `
        --window-size 5m `
        --severity 2 `
        --action-groups "/subscriptions/$(az account show --query id -o tsv)/resourceGroups/${{ parameters.resourceGroup }}/providers/microsoft.insights/actionGroups/myapp-alerts"

Step 5: Variable Groups and Environments

Create Variable Groups

In Azure DevOps, create variable groups for each environment:

Development Variables:

Environment: Development
DatabaseConnectionString: (linked to Key Vault)
ApplicationInsights.ConnectionString: (from deployment output)

Staging Variables:

Environment: Staging
DatabaseConnectionString: (linked to Key Vault)
ApplicationInsights.ConnectionString: (from deployment output)

Production Variables:

Environment: Production
DatabaseConnectionString: (linked to Key Vault)
ApplicationInsights.ConnectionString: (from deployment output)

Configure Environments

Create environments in Azure DevOps with appropriate approvals and checks:

Development: Auto-approval
Staging: Auto-approval with branch protection (main only)
Production: Manual approval required with 2-person approval policy

Step 6: Advanced Production Features

Blue/Green Deployment with Traffic Splitting

Add traffic splitting configuration:

- task: AzureAppServiceManage@0
  displayName: 'Configure traffic routing (10% to staging)'
  inputs:
    azureSubscription: 'MyAzureSubscription'
    Action: 'Swap Slots'
    WebAppName: '${{ parameters.webAppName }}'
    ResourceGroupName: '${{ parameters.resourceGroup }}'
    SourceSlot: 'staging'
    SwapWithProduction: false
    PreserveVnet: true
    RouteTrafficPercentage: 10

- task: PowerShell@2
  displayName: 'Monitor metrics during canary deployment'
  inputs:
    targetType: 'inline'
    script: |
      # Monitor for 10 minutes
      $endTime = (Get-Date).AddMinutes(10)
      
      while ((Get-Date) -lt $endTime) {
          # Check error rate, response time, etc.
          $errorRate = # Query Application Insights
          
          if ($errorRate -gt 0.05) {  # 5% error threshold
              Write-Error "High error rate detected: $errorRate"
              exit 1
          }
          
          Start-Sleep -Seconds 60
      }

- task: AzureAppServiceManage@0
  displayName: 'Complete swap to production'
  inputs:
    azureSubscription: 'MyAzureSubscription'
    Action: 'Swap Slots'
    WebAppName: '${{ parameters.webAppName }}'
    ResourceGroupName: '${{ parameters.resourceGroup }}'
    SourceSlot: 'staging'
    SwapWithProduction: true

Automated Rollback

Implement automated rollback capabilities:

- task: PowerShell@2
  displayName: 'Monitor post-deployment metrics'
  inputs:
    targetType: 'inline'
    script: |
      $monitoringDuration = 300  # 5 minutes
      $checkInterval = 30        # 30 seconds
      $endTime = (Get-Date).AddSeconds($monitoringDuration)
      
      while ((Get-Date) -lt $endTime) {
          try {
              # Check health endpoint
              $healthResponse = Invoke-RestMethod -Uri "https://${{ parameters.webAppName }}.azurewebsites.net/health" -TimeoutSec 10
              
              # Check Application Insights metrics
              $errorRate = # Query error rate from App Insights
              $responseTime = # Query average response time
              
              if ($errorRate -gt 0.05 -or $responseTime -gt 2000) {
                  Write-Error "Performance degradation detected. Initiating rollback..."
                  
                  # Trigger rollback
                  az webapp deployment slot swap --name "${{ parameters.webAppName }}" --resource-group "${{ parameters.resourceGroup }}" --slot staging --target-slot production
                  
                  exit 1
              }
              
              Write-Host "Metrics within acceptable range. Error rate: $errorRate, Response time: $responseTime ms"
              
          } catch {
              Write-Warning "Error checking metrics: $($_.Exception.Message)"
          }
          
          Start-Sleep -Seconds $checkInterval
      }
      
      Write-Host "Post-deployment monitoring completed successfully"

Database Migration Pipeline

Create a separate pipeline for database migrations:

# database-migration-pipeline.yml
trigger: none

parameters:
- name: environment
  displayName: Environment
  type: string
  default: staging
  values:
  - staging
  - production

variables:
  environment: ${{ parameters.environment }}

stages:
- stage: DatabaseMigration
  displayName: 'Database Migration - $(environment)'
  jobs:
  - job: Migration
    displayName: 'Run Database Migration'
    pool:
      vmImage: 'windows-latest'
    steps:
    - task: UseDotNet@2
      inputs:
        packageType: 'sdk'
        version: '6.0.x'

    - task: AzureKeyVault@2
      inputs:
        azureSubscription: 'MyAzureSubscription'
        KeyVaultName: 'kv-myapp-$(environment)'
        SecretsFilter: 'DatabaseConnectionString'

    - task: DotNetCoreCLI@2
      displayName: 'Run EF Migrations'
      inputs:
        command: 'custom'
        custom: 'ef'
        arguments: 'database update --connection "$(DatabaseConnectionString)" --project MyApp.Data --startup-project MyApp.Web'
      env:
        ConnectionStrings__DefaultConnection: '$(DatabaseConnectionString)'

    - task: PowerShell@2
      displayName: 'Verify Migration'
      inputs:
        targetType: 'inline'
        script: |
          # Run verification queries to ensure migration succeeded
          # This could include checking table structure, data integrity, etc.

Step 7: Monitoring and Observability

Application Insights Integration

Configure detailed monitoring:

// In Program.cs
builder.Services.AddApplicationInsightsTelemetry(options =>
{
    options.ConnectionString = builder.Configuration["ApplicationInsights:ConnectionString"];
});

builder.Services.AddSingleton<ITelemetryInitializer, CustomTelemetryInitializer>();

// Custom telemetry initializer
public class CustomTelemetryInitializer : ITelemetryInitializer
{
    public void Initialize(ITelemetry telemetry)
    {
        if (telemetry is RequestTelemetry requestTelemetry)
        {
            requestTelemetry.Properties["Environment"] = Environment.GetEnvironmentVariable("ASPNETCORE_ENVIRONMENT");
            requestTelemetry.Properties["Version"] = Assembly.GetExecutingAssembly().GetName().Version?.ToString();
        }
    }
}

Dashboard Creation

Create Azure Dashboard for monitoring:

{
  "properties": {
    "lenses": [
      {
        "order": 0,
        "parts": [
          {
            "position": { "x": 0, "y": 0, "rowSpan": 4, "colSpan": 6 },
            "metadata": {
              "inputs": [
                {
                  "name": "ComponentId",
                  "value": "/subscriptions/{subscription-id}/resourceGroups/rg-myapp-prod/providers/microsoft.insights/components/ai-myapp-prod"
                }
              ],
              "type": "Extension/AppInsightsExtension/PartType/AvailabilityNavButtonPart"
            }
          },
          {
            "position": { "x": 6, "y": 0, "rowSpan": 4, "colSpan": 6 },
            "metadata": {
              "inputs": [
                {
                  "name": "ComponentId", 
                  "value": "/subscriptions/{subscription-id}/resourceGroups/rg-myapp-prod/providers/microsoft.insights/components/ai-myapp-prod"
                }
              ],
              "type": "Extension/AppInsightsExtension/PartType/PerformanceNavButtonPart"
            }
          }
        ]
      }
    ]
  }
}

Step 8: Security and Compliance

Secure Configuration Management

- task: AzureKeyVault@2
  displayName: 'Get secrets from Key Vault'
  inputs:
    azureSubscription: 'MyAzureSubscription'
    KeyVaultName: 'kv-myapp-$(environment)'
    SecretsFilter: |
      DatabaseConnectionString
      ApiKey
      JwtSecret
    RunAsPreJob: true

- task: FileTransform@1
  displayName: 'Transform configuration files'
  inputs:
    folderPath: '$(Pipeline.Workspace)/drop/app'
    fileType: 'json'
    targetFiles: '**/appsettings.json'

Compliance Scanning

Add compliance checks to your pipeline:

- task: ms-codeanalysis.vss-microsoft-security-code-analysis-devops.build-task-credscan.CredScan@2
  displayName: 'Run Credential Scanner'
  inputs:
    toolMajorVersion: 'V2'
    scanFolder: '$(Build.SourcesDirectory)'
    debugMode: false

- task: ms-codeanalysis.vss-microsoft-security-code-analysis-devops.build-task-binskim.BinSkim@3
  displayName: 'Run BinSkim'
  inputs:
    InputType: 'Basic'
    Function: 'analyze'
    AnalyzeTarget: '$(Build.ArtifactStagingDirectory)/**/*.dll;$(Build.ArtifactStagingDirectory)/**/*.exe'

- task: ms-codeanalysis.vss-microsoft-security-code-analysis-devops.build-task-postanalysis.PostAnalysis@1
  displayName: 'Post Analysis'
  inputs:
    AllTools: false
    BinSkim: true
    CredScan: true
    ToolLogsNotFoundAction: 'Standard'

Step 9: Performance Testing

Add performance testing stage:

- stage: PerformanceTesting
  displayName: 'Performance Testing'
  dependsOn: DeployStaging
  condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main'))
  jobs:
  - job: LoadTest
    displayName: 'Run Load Tests'
    pool:
      vmImage: 'ubuntu-latest'
    steps:
    - task: AzureLoadTest@1
      displayName: 'Azure Load Testing'
      inputs:
        azureSubscription: 'MyAzureSubscription'
        loadTestConfigFile: 'loadtest/config.yaml'
        loadTestResource: 'loadtest-myapp'
        resourceGroup: 'rg-myapp-shared'
        env: |
          [
            {
              "name": "webapp-url",
              "value": "https://app-myapp-staging.azurewebsites.net"
            }
          ]

    - task: PublishTestResults@2
      displayName: 'Publish load test results'
      inputs:
        testResultsFormat: 'JUnit'
        testResultsFiles: '$(System.DefaultWorkingDirectory)/**/*loadtest-results.xml'
        failTaskOnFailedTests: true

Step 10: Disaster Recovery and Backup

Automated Backup Configuration

- task: AzureCLI@2
  displayName: 'Configure backup policy'
  inputs:
    azureSubscription: 'MyAzureSubscription'
    scriptType: 'bash'
    scriptLocation: 'inlineScript'
    inlineScript: |
      # Create storage account for backups
      az storage account create \
        --name "stmyappbackup$(environment)" \
        --resource-group "$(resourceGroup)" \
        --location "East US" \
        --sku "Standard_LRS"
      
      # Configure app service backup
      az webapp config backup update \
        --resource-group "$(resourceGroup)" \
        --webapp-name "$(webAppName)" \
        --container-url "$(az storage account show-connection-string --name stmyappbackup$(environment) --resource-group $(resourceGroup) --query connectionString -o tsv)" \
        --frequency 24 \
        --retain-one true \
        --retention-period-in-days 30

Conclusion

This comprehensive Azure DevOps pipeline provides enterprise-grade capabilities including:

Infrastructure as Code with Bicep templates
Multi-environment deployments with appropriate gates
Zero-downtime deployments using slot swaps
Automated testing at multiple stages
Security scanning and compliance checks
Performance testing integration
Monitoring and alerting setup
Automated rollback capabilities
Disaster recovery configurations

The pipeline ensures high availability, security, and maintainability while providing the flexibility to adapt to changing requirements. Regular monitoring and continuous improvement of the pipeline based on operational feedback will help maintain its effectiveness in production environments.

Key benefits of this approach include reduced deployment risk, faster time-to-market, improved application quality, and enhanced operational visibility across the entire deployment lifecycle.

admin July 16, 2025 AzureNo Comments »

admin | July 15, 2025

Deploying SLURM with Slinky: Bridging HPC and Kubernetes for Container Workloads

High-Performance Computing (HPC) environments are evolving rapidly, and the need to integrate traditional HPC job schedulers with modern containerized infrastructure has never been greater. Enter Slinky – SchedMD’s official project that seamlessly integrates SLURM with Kubernetes, enabling you to run containerized workloads through SLURM’s powerful scheduling capabilities.

In this comprehensive guide, we’ll walk through deploying SLURM using Slinky with Docker container support, bringing together the best of both HPC and cloud-native worlds.

What is Slinky?

Slinky is a toolbox of components developed by SchedMD (the creators of SLURM) to integrate SLURM with Kubernetes. Unlike traditional approaches that force users to change how they interact with SLURM, Slinky preserves the familiar SLURM user experience while adding powerful container orchestration capabilities.

Key Components:

Slurm Operator – Manages SLURM clusters as Kubernetes resources
Container Support – Native OCI container execution through SLURM
Auto-scaling – Dynamic resource allocation based on workload demand
Slurm Bridge – Converged workload scheduling and prioritization

Why Slinky Matters: Slinky enables simultaneous management of HPC workloads using SLURM and containerized applications via Kubernetes on the same infrastructure, making it ideal for organizations running AI/ML training, scientific simulations, and cloud-native applications.

Prerequisites and Environment Setup

Before we begin, ensure you have a working Kubernetes cluster with the following requirements:

Kubernetes 1.24+ cluster with admin access
Helm 3.x installed
kubectl configured and connected to your cluster
Sufficient cluster resources (minimum 4 CPU cores, 8GB RAM)

Step 1: Install Required Dependencies

Slinky requires several prerequisite components. Let’s install them using Helm:

# Add required Helm repositories
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server/
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo add jetstack https://charts.jetstack.io
helm repo update

# Install cert-manager for TLS certificate management
helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager --create-namespace --set crds.enabled=true

# Install Prometheus stack for monitoring
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace prometheus --create-namespace --set installCRDs=true

Wait for all pods to be running before proceeding:

# Verify installations
kubectl get pods -n cert-manager
kubectl get pods -n prometheus

Step 2: Deploy the Slinky SLURM Operator

Now we’ll install the core Slinky operator that manages SLURM clusters within Kubernetes:

# Download the default configuration
curl -L https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/tags/v0.2.1/helm/slurm-operator/values.yaml \
  -o values-operator.yaml

# Install the Slurm Operator
helm install slurm-operator oci://ghcr.io/slinkyproject/charts/slurm-operator \
  --values=values-operator.yaml --version=0.2.1 \
  --namespace=slinky --create-namespace

Verify the operator is running:

kubectl get pods -n slinky
# Expected output: slurm-operator pod in Running status

Step 3: Configure Container Support

Before deploying the SLURM cluster, let’s configure it for container support. Download and modify the SLURM configuration:

# Download SLURM cluster configuration
curl -L https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/tags/v0.2.1/helm/slurm/values.yaml \
  -o values-slurm.yaml

Edit values-slurm.yaml to enable container support:

# Add container configuration to values-slurm.yaml
controller:
  config:
    slurm.conf: |
      # Basic cluster configuration
      ClusterName=slinky-cluster
      ControlMachine=slurm-controller-0
      
      # Enable container support
      ProctrackType=proctrack/cgroup
      TaskPlugin=task/cgroup,task/affinity
      PluginDir=/usr/lib64/slurm
      
      # Authentication
      AuthType=auth/munge
      
      # Node configuration
      NodeName=slurm-compute-debug-[0-9] CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=2 ThreadsPerCore=2 State=UNKNOWN
      PartitionName=debug Nodes=slurm-compute-debug-[0-9] Default=YES MaxTime=INFINITE State=UP
      
      # Accounting
      AccountingStorageType=accounting_storage/slurmdbd
      AccountingStorageHost=slurm-accounting-0

compute:
  config:
    oci.conf: |
      # OCI container runtime configuration
      RunTimeQuery="runc --version"
      RunTimeCreate="runc create %n.%u %b"
      RunTimeStart="runc start %n.%u"
      RunTimeKill="runc kill --all %n.%u SIGTERM"
      RunTimeDelete="runc delete --force %n.%u"
      
      # Security and patterns
      OCIPattern="^[a-zA-Z0-9][a-zA-Z0-9_.-]*$"
      CreateEnvFile="/tmp/slurm-oci-create-env-%j.%u.%t.tmp"
      RunTimeEnvExclude="HOME,PATH,LD_LIBRARY_PATH"

Step 4: Deploy the SLURM Cluster

Now deploy the SLURM cluster with container support enabled:

# Deploy SLURM cluster
helm install slurm oci://ghcr.io/slinkyproject/charts/slurm \
  --values=values-slurm.yaml --version=0.2.1 \
  --namespace=slurm --create-namespace

Monitor the deployment progress:

# Watch pods come online
kubectl get pods -n slurm -w

# Expected pods:
# slurm-accounting-0      1/1     Running
# slurm-compute-debug-0   1/1     Running  
# slurm-controller-0      2/2     Running
# slurm-exporter-xxx      1/1     Running
# slurm-login-xxx         1/1     Running
# slurm-mariadb-0         1/1     Running
# slurm-restapi-xxx       1/1     Running

Step 5: Access and Test the SLURM Cluster

Once all pods are running, connect to the SLURM login node:

# Get login node IP address
SLURM_LOGIN_IP="$(kubectl get services -n slurm -l app.kubernetes.io/instance=slurm,app.kubernetes.io/name=login -o jsonpath="{.items[0].status.loadBalancer.ingress[0].ip}")"

# SSH to login node (default port 2222)
ssh -p 2222 root@${SLURM_LOGIN_IP}

If you don’t have LoadBalancer support, use port-forwarding:

# Port forward to login pod
kubectl port-forward -n slurm service/slurm-login 2222:2222

# Connect via localhost
ssh -p 2222 root@localhost

Step 6: Running Container Jobs

Now for the exciting part – running containerized workloads through SLURM!

Basic Container Job

Create a simple container job script:

# Create a container job script
cat > container_test.sh << EOF
#!/bin/bash
#SBATCH --job-name=container-hello
#SBATCH --ntasks=1
#SBATCH --time=00:05:00
#SBATCH --container=docker://alpine:latest

echo "Hello from containerized SLURM job!"
echo "Running on node: \$(hostname)"
echo "Job ID: \$SLURM_JOB_ID"
echo "Container OS: \$(cat /etc/os-release | grep PRETTY_NAME)"
EOF

# Submit the job
sbatch container_test.sh

# Check job status
squeue

Interactive Container Sessions

Run containers interactively using srun:

# Interactive Ubuntu container
srun --container=docker://ubuntu:20.04 /bin/bash

# Quick command in Alpine container
srun --container=docker://alpine:latest /bin/sh -c "echo 'Container execution successful'; uname -a"

# Python data science container
srun --container=docker://python:3.9 python -c "import sys; print(f'Python {sys.version} running in container')"

GPU Container Jobs

If your cluster has GPU nodes, you can run GPU-accelerated containers:

# GPU container job
cat > gpu_container.sh << EOF
#!/bin/bash
#SBATCH --job-name=gpu-test
#SBATCH --gres=gpu:1
#SBATCH --container=docker://nvidia/cuda:11.0-runtime-ubuntu20.04

nvidia-smi
nvcc --version
EOF

sbatch gpu_container.sh

MPI Container Jobs

Run parallel MPI applications in containers:

# MPI container job
cat > mpi_container.sh << EOF
#!/bin/bash
#SBATCH --job-name=mpi-test
#SBATCH --ntasks=4
#SBATCH --container=docker://mpirun/openmpi:latest

mpirun -np \$SLURM_NTASKS hostname
EOF

sbatch mpi_container.sh

Step 7: Monitoring and Auto-scaling

Monitor Cluster Health

Check SLURM cluster status from the login node:

# Check node status
sinfo

# Check running jobs
squeue

# Check cluster configuration
scontrol show config | grep -i container

Kubernetes Monitoring

Monitor from the Kubernetes side:

# Check pod resource usage
kubectl top pods -n slurm

# View SLURM operator logs
kubectl logs -n slinky deployment/slurm-operator

# Check custom resources
kubectl get clusters.slinky.slurm.net -n slurm
kubectl get nodesets.slinky.slurm.net -n slurm

Configure Auto-scaling

Enable auto-scaling by updating your values file:

# Add to values-slurm.yaml
compute:
  autoscaling:
    enabled: true
    minReplicas: 1
    maxReplicas: 10
    targetCPUUtilizationPercentage: 70

# Update the deployment
helm upgrade slurm oci://ghcr.io/slinkyproject/charts/slurm \
  --values=values-slurm.yaml --version=0.2.1 \
  --namespace=slurm

Advanced Configuration Tips

Custom Container Runtimes

Configure alternative container runtimes like Podman:

# Alternative oci.conf for Podman
compute:
  config:
    oci.conf: |
      # Podman runtime configuration
      RunTimeQuery="podman --version"
      RunTimeRun="podman run --rm --cgroups=disabled --name=%n.%u %m %c"
      
      # Security settings
      OCIPattern="^[a-zA-Z0-9][a-zA-Z0-9_.-]*$"
      CreateEnvFile="/tmp/slurm-oci-create-env-%j.%u.%t.tmp"

Persistent Storage for Containers

Configure persistent volumes for containerized jobs:

# Add persistent volume support
compute:
  persistence:
    enabled: true
    storageClass: "fast-ssd"
    size: "100Gi"
    mountPath: "/shared"

Troubleshooting Common Issues

Container Runtime Not Found

If you encounter container runtime errors:

# Check runtime availability on compute nodes
kubectl exec -n slurm slurm-compute-debug-0 -- which runc
kubectl exec -n slurm slurm-compute-debug-0 -- runc --version

# Verify oci.conf is properly mounted
kubectl exec -n slurm slurm-compute-debug-0 -- cat /etc/slurm/oci.conf

Job Submission Failures

Debug job submission issues:

# Check SLURM logs
kubectl logs -n slurm slurm-controller-0 -c slurmctld

# Verify container image availability
srun --container=docker://alpine:latest /bin/echo "Container test"

# Check job details
scontrol show job

Conclusion

Slinky represents a significant step forward in bridging the gap between traditional HPC and modern cloud-native infrastructure. By deploying SLURM with Slinky, you get:

Unified Infrastructure - Run both SLURM and Kubernetes workloads on the same cluster
Container Support - Native OCI container execution through familiar SLURM commands
Auto-scaling - Dynamic resource allocation based on workload demand
Cloud Native - Standard Kubernetes deployment and management patterns
Preserved Workflow - Keep existing SLURM scripts and user experience

This powerful combination enables organizations to modernize their HPC infrastructure while maintaining the robust scheduling and resource management capabilities that SLURM is known for. Whether you're running AI/ML training workloads, scientific simulations, or data processing pipelines, Slinky provides the flexibility to containerize your applications without sacrificing the control and efficiency of SLURM.

Next Steps: Consider exploring Slinky's advanced features like custom schedulers, resource quotas, and integration with cloud provider auto-scaling groups to further optimize your HPC container workloads.

Ready to get started? The Slinky project is open-source and available on GitHub. Visit the SlinkyProject GitHub organization for the latest documentation and releases.

admin July 15, 2025 HPCNo Comments »

Nick Tailor's Technical Blog

A detail-minded individual, combining strong technical understanding and communication skills with experiences in Systems Administration, Engineering, Automation, AI Automation and Solutions; a proven methodical problem solver.

Author: admin

Cisco vs Brocade SAN Switch Commands Explained (with Diagnostics and Examples)

1. System Information & Status

Cisco MDS (NX-OS)

Brocade (Fabric OS)

2. Port Configuration Commands

Cisco

Brocade

3. Zoning Configuration

Cisco NX-OS

Brocade (FOS)

4. Fabric and Topology Checks

Cisco

Brocade

5. Diagnostic & Troubleshooting Commands

Cisco MDS Diagnostics

Brocade FOS Diagnostics

Typical Troubleshooting Flow:

6. Backup and Restore

Cisco

Brocade

7. Quick Reference Summary

8. Real-World Diagnostic Flow

9. Conclusion

Slurm Job: Cluster Sampler & Diagnostics (One-Click)

A practical, repeatable workflow for NVIDIA-GPU Linux clusters (Slurm/K8s or bare-metal) to pinpoint whether your bottleneck is GPU, CPU, memory bandwidth, or network

Profiling Playbook: Detect GPU/CPU, Memory Bandwidth, and Network Bottlenecks

0) Prep: Make the Test Reproducible

1) GPU Profiling (Utilization, Kernels, Memory, Interconnect)

Quick Live View (low overhead)

Deep Timeline (Nsight Systems → find where time is spent)

Kernel Efficiency (Nsight Compute → why GPU is slow)

NVLink / PCIe Health

2) CPU & Memory-Bandwidth Profiling (Host Side)

Fast CPU View

NUMA Locality (critical for feeders/data loaders)

Hardware Counters (perf) & Memory Bandwidth

3) Network Throughput/Latency (Intra & Inter-node)

Raw NIC Performance

RDMA / InfiniBand (if applicable)

Collective (NCCL) Reality Check

NIC Counters / Driver

4) Tie It Together with a Roofline View

5) Microbenchmarks to Isolate Layers

6) Common Bottlenecks → Fixes

7) Optional: One-Node Sampler Script

8) HPE-Specific Checks (If Relevant)

Microsoft 365 Security in Azure/Entra – Step‑by‑Step Deployment Playbook

Table of Contents

0) Pre‑reqs & Planning

1) Create Tenant & Verify Domain

2) Identity Foundations (Entra)

2.1 Break‑Glass Accounts

2.2 Least Privilege & PIM

2.3 Prereqs & Auth Methods

3) Conditional Access — Secure Baseline

4) Endpoint & Device Management (Intune)

5) Threat Protection — Defender for Office 365

6) Data Protection — Purview

Sensitivity Labels

Auto‑Labeling & DLP

Retention

7) Collaboration Controls — SharePoint/OneDrive/Teams

8) Logging, Monitoring, and SIEM

9) Admin Hardening & Operations

10) Rollout & Testing Plan

Validation Checklist

11) PowerShell Quick‑Starts

12) Common Pitfalls

13) Reusable Templates

CA Baseline

Intune Compliance (Windows)

DLP Starter

Purview Labels

14) Ops Runbook

15) Portal Shortcuts

Automated Ultra-Low Latency System Analysis: A Smart Script for Performance Engineers

Meet the Ultra-Low Latency System Analyzer