RAG Pipeline/Demo: Understanding – Retrieval Augmented Generation
This project is a deep, production-aligned demonstration of a Retrieval Augmented Generation (RAG) system applied to realistic insurance documents. Rather than hiding complexity, this demo makes every stage observable: document ingestion, chunking, embeddings, vector search, retrieval behavior, and how the LLM ultimately produces grounded answers. This post walks through the system exactly as an insurance AI engineer would debug, evaluate,Read More …
nvitop – The Ultimate Interactive NVIDIA GPU Monitoring Tool
If you’re working with NVIDIA GPUs whether for deep learning, HPC, or systems administration you’ve likely used the default nvidia-smi tool to check GPU status. But what if you want something more dynamic, interactive, and user-friendly? Enter nvitop, an incredible interactive NVIDIA GPU process viewer that makes monitoring your GPUs intuitive and informative. nvitop in action — real-time GPU monitoringRead More …
SLURM Accounting Setup; my personal notes
SLURM accounting tracks every job that runs on your cluster — who submitted it, what resources it used, how long it ran, and which account to bill. This data powers fairshare scheduling, resource limits, usage reports, and chargeback billing. This post walks through setting up SLURM accounting from scratch in a production environment, with the database on a dedicated serverRead More …
Nick Tailor Notes…Essential SLURM Diagnostic Commands: Outputs and What They Mean
When managing HPC clusters, knowing how to quickly diagnose job issues, node problems, and cluster health is essential. SLURM provides a comprehensive set of commands for this purpose, but understanding the output is just as important as knowing which command to run. This post covers the most common SLURM diagnostic commands, their expected outputs, and how to interpret what you’reRead More …
SLURM Production Partitions: A Practical Guide to Job Scheduling
When managing HPC clusters in production, how you structure your SLURM partitions directly impacts cluster efficiency, user experience, and resource utilisation. A well-designed partition layout ensures the right jobs land on the right hardware, fair scheduling across user groups, and predictable turnaround times.This post covers typical production partition configurations and provides ready-to-use job script templates for each workload type. WhatRead More …
Building a Reusable HPC Diagnostic Harness for NUMA, CPU, GPU, MPI & InfiniBand
When operating HPC and AI infrastructure at scale, performance issues are rarely caused by a single factor. They are usually the result of subtle misalignment between CPU placement, NUMA locality, memory allocation, accelerator topology, or network fabric behaviour. This post walks through how to build a reusable diagnostic harness that allows you to methodically inspect these layers, collect evidence, andRead More …
How to Deploy a Kubernetes Application with a Clean Namespace Structure
How to Deploy a Kubernetes Application with a Clean Namespace Structure When you deploy an application to Kubernetes in production, you shouldn’t throw everything into the default namespace or a single giant YAML file. A proper setup uses: A dedicated namespace for the app A ServiceAccount and RBAC for security ConfigMap and Secret for configuration Deployment, Service, and Ingress forRead More …
Cisco vs Brocade SAN Switch Commands Explained (with Diagnostics and Examples)
Enterprise SAN switches from Cisco (MDS) and Brocade (Broadcom) power mission-critical storage networks. Whether you manage VMware, EMC VPLEX, or multi-array clusters, understanding the core and diagnostic commands is essential for maintaining performance and uptime.This article lists the most common operational, configuration, and diagnostic commands, explained clearly and paired with real-world examples. 1. System Information & Status Cisco MDS (NX-OS)Read More …
Slurm Job: Cluster Sampler & Diagnostics (One-Click)
This job collects GPU/CPU, memory, NUMA, PCIe/NVLink, NIC/IB, and optional Nsight/NCCL/iperf3 telemetry across all allocated nodes while your workload runs, then bundles everything into a single .tgz. Usage: Save as profile_env.slurm and submit: sbatch –export=ALL,WORKLOAD=”torchrun –nproc_per_node=8 train.py –cfg config.yaml”,ENABLE_NSYS=1,RUN_NCCL_TESTS=1,DURATION=1800 profile_env.slurm Copy #!/usr/bin/env bash # # profile_env.slurm — cluster-wide performance sampler & diagnostics # #SBATCH -J prof-playbook #SBATCH -o prof-%x-%j.out #SBATCHRead More …







