Nick Tailor's Technical Blog

A detail-minded individual, combining strong technical understanding and communication skills with experiences @ the Senior Level:

Systems administration, Engineering, Low Latency, AI automation & Solutions

A proven methodical problem solver

I dont know the meaning of the word cant.

How to Deploy a Kubernetes Application with a Clean Namespace Structure

How to Deploy a Kubernetes Application with a Clean Namespace Structure When you deploy an application to Kubernetes in production, you shouldn’t throw everything into the default namespace or a single giant YAML file. A proper setup uses: A dedicated namespace for the app A ServiceAccount and RBAC for security ConfigMap and Secret for configuration Deployment, Service, and Ingress forRead More …

Cisco vs Brocade SAN Switch Commands Explained (with Diagnostics and Examples)

Enterprise SAN switches from Cisco (MDS) and Brocade (Broadcom) power mission-critical storage networks. Whether you manage VMware, EMC VPLEX, or multi-array clusters, understanding the core and diagnostic commands is essential for maintaining performance and uptime.This article lists the most common operational, configuration, and diagnostic commands, explained clearly and paired with real-world examples. 1. System Information & Status Cisco MDS (NX-OS)Read More …

Slurm Job: Cluster Sampler & Diagnostics (One-Click)

This job collects GPU/CPU, memory, NUMA, PCIe/NVLink, NIC/IB, and optional Nsight/NCCL/iperf3 telemetry across all allocated nodes while your workload runs, then bundles everything into a single .tgz. Usage: Save as profile_env.slurm and submit: sbatch –export=ALL,WORKLOAD=”torchrun –nproc_per_node=8 train.py –cfg config.yaml”,ENABLE_NSYS=1,RUN_NCCL_TESTS=1,DURATION=1800 profile_env.slurm Copy #!/usr/bin/env bash # # profile_env.slurm — cluster-wide performance sampler & diagnostics # #SBATCH -J prof-playbook #SBATCH -o prof-%x-%j.out #SBATCHRead More …

A practical, repeatable workflow for NVIDIA-GPU Linux clusters (Slurm/K8s or bare-metal) to pinpoint whether your bottleneck is GPU, CPU, memory bandwidth, or network

Profiling Playbook: Detect GPU/CPU, Memory Bandwidth, and Network Bottlenecks A practical, repeatable workflow for NVIDIA-GPU Linux clusters (Slurm/K8s or bare-metal) to pinpoint whether your bottleneck is GPU, CPU, memory bandwidth, or network. 0) Prep: Make the Test Reproducible Choose a workload: (a) your real training/inference job, plus (b) a couple of microbenchmarks. Pin placement/affinity: match production (same container, CUDA/cuDNN, drivers,Read More …

Microsoft 365 Security in Azure/Entra – Step‑by‑Step Deployment Playbook

A practical, production‑ready guide to ship a secure Microsoft 365 tenant using Entra ID (Azure AD), Conditional Access, Intune, Defender, and Purview — with rollback safety and validation checklists. M365 Azure / Entra Conditional Access Intune Defender & Purview Outcome: In a few hours, you’ll have MFA + Conditional Access, device trust with Intune, phishing/malware defense with Defender, and dataRead More …

Complete Latency Troubleshooting Command Reference

How to Read This Guide: Each command shows the actual output you’ll see on your system. The green/red examples below each command show real outputs – green means your system is optimized for low latency, red means there are problems that will cause latency spikes. Compare your actual output to these examples to quickly identify issues. SECRET SAUCE: I didRead More …

Building Production-Ready Release Pipelines in AWS: A Step-by-Step Guide

Building a robust, production-ready release pipeline in AWS requires careful planning, proper configuration, and adherence to best practices. This comprehensive guide will walk you through creating an enterprise-grade release pipeline using AWS native services, focusing on real-world production scenarios. Architecture Overview Our production pipeline will deploy a web application to EC2 instances behind an Application Load Balancer, implementing blue/green deploymentRead More …

Mastering Ultra-Low Latency Systems: A Deep Dive into Bare-Metal Performance

In the world of high-frequency trading, real-time systems, and mission-critical applications, every nanosecond matters. This comprehensive guide explores the art and science of building ultra-low latency systems that push hardware to its absolute limits. Understanding the Foundations Ultra-low latency systems demand a holistic approach to performance optimization. We’re talking about achieving deterministic execution with sub-microsecond response times, zero packet loss,Read More …

Building Production-Ready Release Pipelines in Azure: A Step-by-Step Guide using Arm Templates

Creating enterprise-grade release pipelines in Azure requires a comprehensive understanding of Azure DevOps services, proper configuration, and adherence to production best practices. This detailed guide will walk you through building a robust CI/CD pipeline that deploys applications to Azure App Services with slot-based deployments for zero-downtime releases. Architecture Overview Our production pipeline will deploy a .NET web application to AzureRead More …

A detail-minded individual, combining strong technical understanding and communication skills with experiences in Systems Administration, Engineering, Automation, AI Automation and Solutions; a proven methodical problem solver.

How to Deploy Kubernetes on AWS the Scalable Way

How to add DNS entries from Linux to Windows DNS

How to deploy Open-AKC(Authorized Key Chain)

How to check what processes are using your swap

How to Deploy Lustre with ZFS Backend (RDMA, ACLs, Nodemaps, Clients

A practical, repeatable workflow for NVIDIA-GPU Linux clusters (Slurm/K8s or bare-metal) to pinpoint whether your bottleneck is GPU, CPU, memory bandwidth, or network

Docker Cheat Sheet

How to RDP to VNC and authenticate using AD (Redhat 6)

How to compile PHP and run it as a CGI binary

How to Configure Redhat 7 & 8 Network Interfaces using Ansible

Building Production-Ready Release Pipelines in Azure: A Step-by-Step Guide using Arm Templates

SMTP auth relay with postfix

How to do a full restore if you wiped all your LVM’s

How to deploy windows firewall rules with Ansible

How to patch using RHN Satellite 5.0

How to Deploy VM’s in Hyper-V with Ansible

How to configure Ansible to manage Windows Hosts on Ubuntu 16.04

How to add a new SCSI LUN while server is Live

Cisco vs Brocade SAN Switch Commands Explained (with Diagnostics and Examples)

More Cheat Sheet for DevOps Engineers

HOW TO CHECK CPU, MEMORY, & DISKS THRESHHOLDS on an ARRAY of HOSTS.

How to rebuild a drive that’s fallen out of a software raid

How to deploy an EC2 instance in AWS with Terraform

Key Components for Setting Up an HPC Cluster

How to pass an API key with Ansible

How to deploy wazuh-agent with Ansible

How to deploy Wazuh

How to properly upgrade wazuh with a major update (standalone setup)

How to RDP to VNC and authenticate using AD (OpenSuSe)

How to deploy OpenNebula Frontends via Ansible

A detail-minded individual, combining strong technical understanding and communication skills with experiences @ the Senior Level: Systems administration, Engineering, Low Latency, AI automation & Solutions A proven methodical problem solver I dont know the meaning of the word cant.

How to Deploy a Kubernetes Application with a Clean Namespace Structure

Cisco vs Brocade SAN Switch Commands Explained (with Diagnostics and Examples)

Slurm Job: Cluster Sampler & Diagnostics (One-Click)

A practical, repeatable workflow for NVIDIA-GPU Linux clusters (Slurm/K8s or bare-metal) to pinpoint whether your bottleneck is GPU, CPU, memory bandwidth, or network

Microsoft 365 Security in Azure/Entra – Step‑by‑Step Deployment Playbook

Complete Latency Troubleshooting Command Reference

Building Production-Ready Release Pipelines in AWS: A Step-by-Step Guide

Mastering Ultra-Low Latency Systems: A Deep Dive into Bare-Metal Performance

Building Production-Ready Release Pipelines in Azure: A Step-by-Step Guide using Arm Templates

Some of our Proud Partners

A detail-minded individual, combining strong technical understanding and communication skills with experiences @ the Senior Level:
Systems administration, Engineering, Low Latency, AI automation & Solutions
A proven methodical problem solver
I dont know the meaning of the word cant.