A detail-minded individual, combining strong technical understanding and communication skills with experiences @ the Senior Level:

Systems administration, Engineering, Low Latency, AI automation & Solutions

A proven methodical problem solver

I dont know the meaning of the word cant.

nvitop – The Ultimate Interactive NVIDIA GPU Monitoring Tool

If you’re working with NVIDIA GPUs  whether for deep learning, HPC, or systems administration  you’ve likely used the default nvidia-smi tool to check GPU status. But what if you want something more dynamic, interactive, and user-friendly? Enter nvitop, an incredible interactive NVIDIA GPU process viewer that makes monitoring your GPUs intuitive and informative. nvitop in action — real-time GPU monitoringRead More …

SLURM Accounting Setup; my personal notes

SLURM accounting tracks every job that runs on your cluster — who submitted it, what resources it used, how long it ran, and which account to bill. This data powers fairshare scheduling, resource limits, usage reports, and chargeback billing. This post walks through setting up SLURM accounting from scratch in a production environment, with the database on a dedicated serverRead More …

Nick Tailor Notes…Essential SLURM Diagnostic Commands: Outputs and What They Mean

When managing HPC clusters, knowing how to quickly diagnose job issues, node problems, and cluster health is essential. SLURM provides a comprehensive set of commands for this purpose, but understanding the output is just as important as knowing which command to run. This post covers the most common SLURM diagnostic commands, their expected outputs, and how to interpret what you’reRead More …

SLURM Production Partitions: A Practical Guide to Job Scheduling

When managing HPC clusters in production, how you structure your SLURM partitions directly impacts cluster efficiency, user experience, and resource utilisation. A well-designed partition layout ensures the right jobs land on the right hardware, fair scheduling across user groups, and predictable turnaround times.This post covers typical production partition configurations and provides ready-to-use job script templates for each workload type. WhatRead More …

Building a Reusable HPC Diagnostic Harness for NUMA, CPU, GPU, MPI & InfiniBand

When operating HPC and AI infrastructure at scale, performance issues are rarely caused by a single factor. They are usually the result of subtle misalignment between CPU placement, NUMA locality, memory allocation, accelerator topology, or network fabric behaviour. This post walks through how to build a reusable diagnostic harness that allows you to methodically inspect these layers, collect evidence, andRead More …

How to Deploy a Kubernetes Application with a Clean Namespace Structure

How to Deploy a Kubernetes Application with a Clean Namespace Structure When you deploy an application to Kubernetes in production, you shouldn’t throw everything into the default namespace or a single giant YAML file. A proper setup uses: A dedicated namespace for the app A ServiceAccount and RBAC for security ConfigMap and Secret for configuration Deployment, Service, and Ingress forRead More …

Cisco vs Brocade SAN Switch Commands Explained (with Diagnostics and Examples)

Enterprise SAN switches from Cisco (MDS) and Brocade (Broadcom) power mission-critical storage networks. Whether you manage VMware, EMC VPLEX, or multi-array clusters, understanding the core and diagnostic commands is essential for maintaining performance and uptime.This article lists the most common operational, configuration, and diagnostic commands, explained clearly and paired with real-world examples. 1. System Information & Status Cisco MDS (NX-OS)Read More …

Slurm Job: Cluster Sampler & Diagnostics (One-Click)

This job collects GPU/CPU, memory, NUMA, PCIe/NVLink, NIC/IB, and optional Nsight/NCCL/iperf3 telemetry across all allocated nodes while your workload runs, then bundles everything into a single .tgz. Usage: Save as profile_env.slurm and submit: sbatch –export=ALL,WORKLOAD=”torchrun –nproc_per_node=8 train.py –cfg config.yaml”,ENABLE_NSYS=1,RUN_NCCL_TESTS=1,DURATION=1800 profile_env.slurm Copy #!/usr/bin/env bash # # profile_env.slurm — cluster-wide performance sampler & diagnostics # #SBATCH -J prof-playbook #SBATCH -o prof-%x-%j.out #SBATCHRead More …

A practical, repeatable workflow for NVIDIA-GPU Linux clusters (Slurm/K8s or bare-metal) to pinpoint whether your bottleneck is GPU, CPU, memory bandwidth, or network

Profiling Playbook: Detect GPU/CPU, Memory Bandwidth, and Network Bottlenecks A practical, repeatable workflow for NVIDIA-GPU Linux clusters (Slurm/K8s or bare-metal) to pinpoint whether your bottleneck is GPU, CPU, memory bandwidth, or network. 0) Prep: Make the Test Reproducible Choose a workload: (a) your real training/inference job, plus (b) a couple of microbenchmarks. Pin placement/affinity: match production (same container, CUDA/cuDNN, drivers,Read More …

Some of our Proud Partners

Client Logo 1Client Logo 2Client Logo 3Client Logo 4Client Logo 5Client Logo 6Client Logo 7
Talk to Friday