A detail-minded individual, combining strong technical understanding and communication skills with experiences @ the Senior Level:

Systems administration, Engineering, Low Latency, AI automation & Solutions

A proven methodical problem solver

I dont know the meaning of the word cant.

Approaches to Server Security: Stop Thinking Like It’s 2010

Server Security  /  March 2026 The patterns showing up in server logs over recent months suggest that the attack surface has shifted in some fairly predictable ways. A few straightforward measures appear to address the bulk of it. The Pattern in the Logs: Digital Ocean Anyone running a public-facing server and watching their /var/log/auth.log or fail2ban output will likely noticeRead More …

Security Hole Cpanel – Wp-tool-kit: Deeper Look…🤦‍♂️

I run security audits regularly. I’ve seen misconfigurations, oversights, and the occasional lazy shortcut. What I found in cPanel’s WordPress Toolkit is unbelievable… This doesn’t appear to be a bug. This is a deliberate architectural decision that gives unauditable code unrestricted root access to your server. By default. Without your consent. 😮🤦‍♂️ Millions of production servers are running this rightRead More …

Security hole: WP Toolkit Deploys Wide Open Sudoers by Default – Here’s How to Fix It

If you’re running cPanel, you’re almost certainly running WP Toolkit. It’s installed by default on cPanel servers and is the standard tool for managing WordPress installations. Here’s the problem: WP Toolkit deploys with a sudoers configuration that gives it passwordless root access to your entire server. This isn’t something you enabled. It’s there out of the box. That means everyRead More …

RAG Pipeline/Demo: Understanding – Retrieval Augmented Generation

This project is a deep, production-aligned demonstration of a Retrieval Augmented Generation (RAG) system applied to realistic insurance documents. Rather than hiding complexity, this demo makes every stage observable: document ingestion, chunking, embeddings, vector search, retrieval behavior, and how the LLM ultimately produces grounded answers. This post walks through the system exactly as an insurance AI engineer would debug, evaluate,Read More …

nvitop – The Ultimate Interactive NVIDIA GPU Monitoring Tool

If you’re working with NVIDIA GPUs  whether for deep learning, HPC, or systems administration  you’ve likely used the default nvidia-smi tool to check GPU status. But what if you want something more dynamic, interactive, and user-friendly? Enter nvitop, an incredible interactive NVIDIA GPU process viewer that makes monitoring your GPUs intuitive and informative. nvitop in action — real-time GPU monitoringRead More …

SLURM Accounting Setup; my personal notes

SLURM accounting tracks every job that runs on your cluster — who submitted it, what resources it used, how long it ran, and which account to bill. This data powers fairshare scheduling, resource limits, usage reports, and chargeback billing. This post walks through setting up SLURM accounting from scratch in a production environment, with the database on a dedicated serverRead More …

Nick Tailor Notes…Essential SLURM Diagnostic Commands: Outputs and What They Mean

When managing HPC clusters, knowing how to quickly diagnose job issues, node problems, and cluster health is essential. SLURM provides a comprehensive set of commands for this purpose, but understanding the output is just as important as knowing which command to run. This post covers the most common SLURM diagnostic commands, their expected outputs, and how to interpret what you’reRead More …

SLURM Production Partitions: A Practical Guide to Job Scheduling

When managing HPC clusters in production, how you structure your SLURM partitions directly impacts cluster efficiency, user experience, and resource utilisation. A well-designed partition layout ensures the right jobs land on the right hardware, fair scheduling across user groups, and predictable turnaround times.This post covers typical production partition configurations and provides ready-to-use job script templates for each workload type. WhatRead More …

Building a Reusable HPC Diagnostic Harness for NUMA, CPU, GPU, MPI & InfiniBand

When operating HPC and AI infrastructure at scale, performance issues are rarely caused by a single factor. They are usually the result of subtle misalignment between CPU placement, NUMA locality, memory allocation, accelerator topology, or network fabric behaviour. This post walks through how to build a reusable diagnostic harness that allows you to methodically inspect these layers, collect evidence, andRead More …

Some of our Proud Partners

Client Logo 1Client Logo 2Client Logo 3Client Logo 4Client Logo 5Client Logo 6Client Logo 7
Talk to Friday