Nick Tailor's Technical Blog

A detail-minded individual, combining strong technical understanding and communication skills with experiences @ the Senior Level:

Systems administration, Engineering, Low Latency, AI automation & Solutions

A proven methodical problem solver

I dont know the meaning of the word cant.

Approaches to Server Security: Stop Thinking Like It’s 2010

Server Security / March 2026 The patterns showing up in server logs over recent months suggest that the attack surface has shifted in some fairly predictable ways. A few straightforward measures appear to address the bulk of it. The Pattern in the Logs: Digital Ocean Anyone running a public-facing server and watching their /var/log/auth.log or fail2ban output will likely noticeRead More …

Security Hole Cpanel – Wp-tool-kit: Deeper Look…🤦‍♂️

I run security audits regularly. I’ve seen misconfigurations, oversights, and the occasional lazy shortcut. What I found in cPanel’s WordPress Toolkit is unbelievable… This doesn’t appear to be a bug. This is a deliberate architectural decision that gives unauditable code unrestricted root access to your server. By default. Without your consent. 😮🤦‍♂️ Millions of production servers are running this rightRead More …

Security hole: WP Toolkit Deploys Wide Open Sudoers by Default – Here’s How to Fix It

If you’re running cPanel, you’re almost certainly running WP Toolkit. It’s installed by default on cPanel servers and is the standard tool for managing WordPress installations. Here’s the problem: WP Toolkit deploys with a sudoers configuration that gives it passwordless root access to your entire server. This isn’t something you enabled. It’s there out of the box. That means everyRead More …

RAG Pipeline/Demo: Understanding – Retrieval Augmented Generation

This project is a deep, production-aligned demonstration of a Retrieval Augmented Generation (RAG) system applied to realistic insurance documents. Rather than hiding complexity, this demo makes every stage observable: document ingestion, chunking, embeddings, vector search, retrieval behavior, and how the LLM ultimately produces grounded answers. This post walks through the system exactly as an insurance AI engineer would debug, evaluate,Read More …

nvitop – The Ultimate Interactive NVIDIA GPU Monitoring Tool

If you’re working with NVIDIA GPUs whether for deep learning, HPC, or systems administration you’ve likely used the default nvidia-smi tool to check GPU status. But what if you want something more dynamic, interactive, and user-friendly? Enter nvitop, an incredible interactive NVIDIA GPU process viewer that makes monitoring your GPUs intuitive and informative. nvitop in action — real-time GPU monitoringRead More …

SLURM Accounting Setup; my personal notes

SLURM accounting tracks every job that runs on your cluster — who submitted it, what resources it used, how long it ran, and which account to bill. This data powers fairshare scheduling, resource limits, usage reports, and chargeback billing. This post walks through setting up SLURM accounting from scratch in a production environment, with the database on a dedicated serverRead More …

Nick Tailor Notes…Essential SLURM Diagnostic Commands: Outputs and What They Mean

When managing HPC clusters, knowing how to quickly diagnose job issues, node problems, and cluster health is essential. SLURM provides a comprehensive set of commands for this purpose, but understanding the output is just as important as knowing which command to run. This post covers the most common SLURM diagnostic commands, their expected outputs, and how to interpret what you’reRead More …

SLURM Production Partitions: A Practical Guide to Job Scheduling

When managing HPC clusters in production, how you structure your SLURM partitions directly impacts cluster efficiency, user experience, and resource utilisation. A well-designed partition layout ensures the right jobs land on the right hardware, fair scheduling across user groups, and predictable turnaround times.This post covers typical production partition configurations and provides ready-to-use job script templates for each workload type. WhatRead More …

Building a Reusable HPC Diagnostic Harness for NUMA, CPU, GPU, MPI & InfiniBand

When operating HPC and AI infrastructure at scale, performance issues are rarely caused by a single factor. They are usually the result of subtle misalignment between CPU placement, NUMA locality, memory allocation, accelerator topology, or network fabric behaviour. This post walks through how to build a reusable diagnostic harness that allows you to methodically inspect these layers, collect evidence, andRead More …

A detail-minded individual, combining strong technical understanding and communication skills with experiences in Systems Administration, Engineering, Automation, AI Automation and Solutions; a proven methodical problem solver.

How to Power Up or Power Down multiple instances in OCI using CLI with Ansible

How to RDP to VNC and authenticate using AD (Redhat 6)

HOW TO CHECK CPU, MEMORY, & DISKS THRESHHOLDS on an ARRAY of HOSTS.

How to Upgrade and Downgrade Packages with RHN Satellite 5.0

How to generate new Network UUID’s with Ansible

How to recover file system corruption on 4T LVM using DDrescue on a VM

How to do a full volume heal with glusterfs

How to setup your own cloud SAN storage at home using FreeNAS and a VM

RAG Pipeline/Demo: Understanding – Retrieval Augmented Generation

Fixing Read-Only Mode on eLux Thin Clients

How to properly upgrade wazuh with a major update (standalone setup)

How to setup ansible on centos 7

How to create snapshots with Ansible (VMware)

Deploying Production-Grade Systems on Oracle Cloud Infrastructure (OCI) with Terraform

How to Configure Redhat 7 & 8 Network Interfaces using Ansible

How to Deploy VM’s in Hyper-V with Ansible

OpenShift Architecture & Migration Design: Building Secure, Scalable Enterprise Platforms

How to setup SMTP port redirect with IPTABLES and NAT

How to Deploy a Kubernetes Application with a Clean Namespace Structure

How to Create a Docker Image for Kubernetes to Deploy

Nick Tailor Notes…Essential SLURM Diagnostic Commands: Outputs and What They Mean

How to Deploy KVM and Use Libvirt to Create VMs from the CLI

How to configure Ansible to manage Windows Hosts on Ubuntu 16.04

How to deploy servers with KickStart 5.0

Building Production-Ready Release Pipelines in AWS: A Step-by-Step Guide

How to diagnose a kernel panic caused by a killed process

How to do multi-threaded backups and restores for mysql

How to deploy multiple sites in IIS with Ansible

How to deploy an EC2 instance in AWS with Terraform

More Cheat Sheet for DevOps Engineers

A detail-minded individual, combining strong technical understanding and communication skills with experiences @ the Senior Level: Systems administration, Engineering, Low Latency, AI automation & Solutions A proven methodical problem solver I dont know the meaning of the word cant.

Approaches to Server Security: Stop Thinking Like It’s 2010

Security Hole Cpanel – Wp-tool-kit: Deeper Look…🤦‍♂️

Security hole: WP Toolkit Deploys Wide Open Sudoers by Default – Here’s How to Fix It

RAG Pipeline/Demo: Understanding – Retrieval Augmented Generation

nvitop – The Ultimate Interactive NVIDIA GPU Monitoring Tool

SLURM Accounting Setup; my personal notes

Nick Tailor Notes…Essential SLURM Diagnostic Commands: Outputs and What They Mean

SLURM Production Partitions: A Practical Guide to Job Scheduling

Building a Reusable HPC Diagnostic Harness for NUMA, CPU, GPU, MPI & InfiniBand

Some of our Proud Partners

A detail-minded individual, combining strong technical understanding and communication skills with experiences @ the Senior Level:
Systems administration, Engineering, Low Latency, AI automation & Solutions
A proven methodical problem solver
I dont know the meaning of the word cant.