Day: March 28, 2025

OpenShift Architecture & Migration Design: Building Secure, Scalable Enterprise Platforms

Designing and migrating to OpenShift is not about installing a cluster. It is about controlling failure domains, aligning schedulers, and avoiding hidden infrastructure bottlenecks that only surface under load or during outages.

This post walks through concrete implementation patterns using Terraform and Ansible, explains why they exist, and highlights what will break if you get them wrong.


Migration Strategy: Phased Approach

Every failed migration I have seen skipped or compressed one of these phases. The pressure to “just move it” creates technical debt that surfaces as production incidents.

Phase 1: Discovery and Assessment

Before touching infrastructure, you need a complete inventory of what exists and how it behaves.


# VMware dependency discovery script
# Export VM metadata, network connections, storage mappings

$vms = Get-VM | Select-Object Name, PowerState, NumCpu, MemoryGB,
  @{N='Datastore';E={(Get-Datastore -VM $_).Name}},
  @{N='Network';E={(Get-NetworkAdapter -VM $_).NetworkName}},
  @{N='VMHost';E={$_.VMHost.Name}}

$vms | Export-Csv -Path "vm-inventory.csv" -NoTypeInformation

# Capture network flows for dependency mapping
# Run for minimum 2 weeks to capture batch jobs and monthly processes

What you are looking for:

  • Hard-coded IPs in application configs
  • NFS mounts and shared storage dependencies
  • Inter-VM communication patterns (what talks to what)
  • Authentication integrations (LDAP, AD, service accounts)
  • Scheduled jobs and their timing dependencies

Assessment deliverables:

  • Application dependency map
  • Containerization readiness score per workload
  • Risk register with mitigation strategies
  • Estimated effort per application (T-shirt sizing)

Phase 2: Target Architecture Design

Design the OpenShift environment before migration begins. This includes cluster topology, namespace strategy, and resource quotas.


# Namespace strategy example
# Environments separated by namespace, not cluster

apiVersion: v1
kind: Namespace
metadata:
  name: app-prod
  labels:
    environment: production
    cost-center: "12345"
    data-classification: confidential
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: app-prod-quota
  namespace: app-prod
spec:
  hard:
    requests.cpu: "40"
    requests.memory: 80Gi
    limits.cpu: "80"
    limits.memory: 160Gi
    persistentvolumeclaims: "20"

Phase 3: Pilot Migration

Select two to three non-critical applications that exercise different patterns:

  • One stateless web application
  • One application with persistent storage
  • One application with external integrations

The pilot validates your tooling, processes, and assumptions before you scale.

Phase 4: Wave Migration

Group applications into waves based on dependencies and risk. Each wave should be independently deployable and rollback-capable.


# Wave planning structure
wave_1:
  applications:
    - name: static-website
      risk: low
      dependencies: none
      estimated_downtime: 0
  success_criteria:
    - all pods healthy for 24 hours
    - response times within 10% of baseline
    - zero error rate increase

wave_2:
  applications:
    - name: api-gateway
      risk: medium
      dependencies:
        - static-website
      estimated_downtime: 5 minutes
  gate:
    - wave_1 success criteria met
    - stakeholder sign-off

Phase 5: Cutover and Decommission

Final traffic switch and legacy teardown. This is where DNS TTL planning matters.

Common cutover failures:

  • DNS TTLs not reduced in advance (reduce to 60 seconds, 48 hours before cutover)
  • Client-side caching ignoring TTL
  • Hardcoded IPs in partner systems
  • Certificate mismatches after DNS change

VMware to OpenShift: Migration Patterns

Not every VM becomes a container. The migration pattern depends on application architecture, not convenience.

Pattern 1: Lift and Containerize

For applications that are already 12-factor compliant or close to it. Package existing binaries into containers with minimal modification.


# Dockerfile for legacy Java application
FROM registry.access.redhat.com/ubi8/openjdk-11-runtime

COPY target/app.jar /deployments/app.jar

ENV JAVA_OPTS="-Xms512m -Xmx2048m"

EXPOSE 8080
CMD ["java", "-jar", "/deployments/app.jar"]

When to use: Application reads config from environment variables, logs to stdout, and has no local state.

Pattern 2: Replatform with Refactoring

Application requires changes to run in containers but core logic remains. Typical changes include externalizing configuration and adding health endpoints.


# Spring Boot health endpoint addition
management:
  endpoints:
    web:
      exposure:
        include: health,info,prometheus
  endpoint:
    health:
      probes:
        enabled: true
      show-details: always

When to use: Application has some container-unfriendly patterns (file-based config, local logging) but is otherwise sound.

Pattern 3: Retain on VM

Some workloads should not be containerized:

  • Legacy applications with kernel dependencies
  • Workloads requiring specific hardware (GPU passthrough, SR-IOV)
  • Applications with licensing tied to VM or physical host
  • Databases with extreme I/O requirements (evaluate case by case)

OpenShift Virtualization (KubeVirt) can run VMs alongside containers when needed.


apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: legacy-app-vm
spec:
  running: true
  template:
    spec:
      domain:
        cpu:
          cores: 4
        memory:
          guest: 8Gi
        devices:
          disks:
            - name: rootdisk
              disk:
                bus: virtio
      volumes:
        - name: rootdisk
          persistentVolumeClaim:
            claimName: legacy-app-pvc

Pattern 4: Rebuild or Replace

Application is fundamentally incompatible and would require complete rewrite. Evaluate whether a commercial off-the-shelf replacement makes more sense.

Decision matrix:

FactorContainerizeKeep on VMReplace
Strategic valueHighLow/LegacyMedium
Maintenance costAcceptableHigh but stableUnsustainable
12-factor compliancePartial or fullNoneN/A
Vendor supportAvailableLegacy onlyN/A

Infrastructure Provisioning with Terraform

Why Terraform First

OpenShift installation assumes the underlying infrastructure is deterministic. If VM placement, CPU topology, or networking varies between environments, the cluster will behave differently under identical workloads. Terraform is used to lock infrastructure intent before OpenShift ever runs.


Example: vSphere Control Plane and Worker Provisioning


provider "vsphere" {
  user           = var.vsphere_user
  password       = var.vsphere_password
  vsphere_server = var.vsphere_server
  allow_unverified_ssl = true
}

data "vsphere_datacenter" "dc" {
  name = var.datacenter
}

data "vsphere_compute_cluster" "cluster" {
  name          = var.cluster
  datacenter_id = data.vsphere_datacenter.dc.id
}

data "vsphere_datastore" "datastore" {
  name          = var.datastore
  datacenter_id = data.vsphere_datacenter.dc.id
}

data "vsphere_network" "network" {
  name          = var.network
  datacenter_id = data.vsphere_datacenter.dc.id
}

resource "vsphere_virtual_machine" "control_plane" {
  count            = 3
  name             = "ocp-master-${count.index}"
  resource_pool_id = data.vsphere_compute_cluster.cluster.resource_pool_id
  datastore_id     = data.vsphere_datastore.datastore.id
  folder           = var.vm_folder

  num_cpus = 8
  memory   = 32768
  guest_id = "rhel8_64Guest"

  # Critical: Reservations prevent resource contention
  cpu_reservation    = 8000
  memory_reservation = 32768

  # Anti-affinity rule reference
  depends_on = [vsphere_compute_cluster_vm_anti_affinity_rule.control_plane_anti_affinity]

  network_interface {
    network_id = data.vsphere_network.network.id
  }

  disk {
    label            = "root"
    size             = 120
    thin_provisioned = false  # Thick provisioning for control plane
  }

  disk {
    label            = "etcd"
    size             = 100
    thin_provisioned = false
    unit_number      = 1
  }
}

# Anti-affinity ensures control plane nodes run on different hosts
resource "vsphere_compute_cluster_vm_anti_affinity_rule" "control_plane_anti_affinity" {
  name               = "ocp-control-plane-anti-affinity"
  compute_cluster_id = data.vsphere_compute_cluster.cluster.id
  virtual_machine_ids = [for vm in vsphere_virtual_machine.control_plane : vm.id]
}

CPU and memory reservations are not optional. Without them, vSphere ballooning and scheduling delays will surface as random etcd latency and API instability.

What usually breaks:

  • etcd timeouts under load (etcd requires consistent sub-10ms disk latency)
  • API server flapping during node pressure
  • Unexplained cluster degradation after vMotion events
  • Split-brain scenarios when anti-affinity is not enforced

Worker Node Pools by Workload Type


resource "vsphere_virtual_machine" "workers_general" {
  count = 6
  name  = "ocp-worker-general-${count.index}"

  num_cpus = 8
  memory   = 32768

  # General workers can use thin provisioning
  disk {
    label            = "root"
    size             = 120
    thin_provisioned = true
  }
}

resource "vsphere_virtual_machine" "workers_stateful" {
  count = 3
  name  = "ocp-worker-stateful-${count.index}"

  num_cpus = 16
  memory   = 65536

  # Stateful workers need guaranteed resources
  cpu_reservation    = 16000
  memory_reservation = 65536

  disk {
    label            = "root"
    size             = 120
    thin_provisioned = false
  }
}

resource "vsphere_virtual_machine" "workers_infra" {
  count = 3
  name  = "ocp-worker-infra-${count.index}"

  num_cpus = 8
  memory   = 32768

  # Infrastructure nodes for routers, monitoring, logging
  disk {
    label            = "root"
    size             = 200
    thin_provisioned = false
  }
}

Different workloads require different failure and performance characteristics. Trying to “let Kubernetes figure it out” leads to noisy neighbors and unpredictable latency.


Post-Provision Configuration with Ansible

Why Ansible Is Still Required

Terraform stops at infrastructure. OpenShift nodes require OS-level hardening, kernel tuning, and configuration consistency before installation. Ignoring this step leads to subtle instability that manifests weeks later.


Example: Node OS Hardening


---
- name: Prepare OpenShift nodes
  hosts: openshift_nodes
  become: true
  tasks:

    - name: Disable swap
      command: swapoff -a
      changed_when: false

    - name: Remove swap from fstab
      replace:
        path: /etc/fstab
        regexp: '^([^#].*swap.*)$'
        replace: '# \1'

    - name: Set kernel parameters for OpenShift
      sysctl:
        name: "{{ item.key }}"
        value: "{{ item.value }}"
        state: present
        sysctl_file: /etc/sysctl.d/99-openshift.conf
      loop:
        - { key: net.ipv4.ip_forward, value: 1 }
        - { key: net.bridge.bridge-nf-call-iptables, value: 1 }
        - { key: net.bridge.bridge-nf-call-ip6tables, value: 1 }
        - { key: vm.max_map_count, value: 262144 }
        - { key: fs.inotify.max_user_watches, value: 1048576 }
        - { key: fs.inotify.max_user_instances, value: 8192 }
        - { key: net.core.somaxconn, value: 32768 }
        - { key: net.ipv4.tcp_max_syn_backlog, value: 32768 }

    - name: Load required kernel modules
      modprobe:
        name: "{{ item }}"
        state: present
      loop:
        - br_netfilter
        - overlay
        - ip_vs
        - ip_vs_rr
        - ip_vs_wrr
        - ip_vs_sh

    - name: Ensure kernel modules load on boot
      copy:
        dest: /etc/modules-load.d/openshift.conf
        content: |
          br_netfilter
          overlay
          ip_vs
          ip_vs_rr
          ip_vs_wrr
          ip_vs_sh

These values are not arbitrary. OpenShift components and container runtimes will fail silently or degrade under load if kernel defaults are used.


Container Runtime Configuration


- name: Configure CRI-O
  copy:
    dest: /etc/crio/crio.conf.d/99-custom.conf
    content: |

[crio.runtime]

default_ulimits = [ “nofile=1048576:1048576” ] pids_limit = 4096

[crio.image]

pause_image = “registry.redhat.io/openshift4/ose-pod:latest” – name: Configure container storage copy: dest: /etc/containers/storage.conf content: |

[storage]

driver = “overlay” runroot = “/run/containers/storage” graphroot = “/var/lib/containers/storage”

[storage.options.overlay]

mountopt = “nodev,metacopy=on”

Default ulimits are insufficient for high-density clusters. You will hit file descriptor exhaustion before CPU or memory limits.


Security Architecture

RBAC Design Principles

Role-based access control should follow least privilege. Avoid cluster-admin grants; use namespace-scoped roles.


# Developer role - namespace scoped
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: developer
  namespace: app-dev
rules:
  - apiGroups: ["", "apps", "batch"]
    resources: ["pods", "deployments", "services", "configmaps", "secrets", "jobs", "cronjobs"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
  - apiGroups: [""]
    resources: ["pods/log", "pods/exec"]
    verbs: ["get", "create"]
  # Explicitly deny access to cluster resources
  - apiGroups: [""]
    resources: ["nodes", "persistentvolumes"]
    verbs: []
---
# Operations role - read-only cluster wide, write in specific namespaces
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: operations-readonly
rules:
  - apiGroups: [""]
    resources: ["nodes", "namespaces", "persistentvolumes"]
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources: ["pods", "services", "endpoints"]
    verbs: ["get", "list", "watch"]

Network Policies

Default deny with explicit allows. Every namespace should have a baseline policy.


# Default deny all ingress
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
  namespace: app-prod
spec:
  podSelector: {}
  policyTypes:
    - Ingress
---
# Allow ingress from same namespace only
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-same-namespace
  namespace: app-prod
spec:
  podSelector: {}
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector: {}
---
# Allow ingress from OpenShift router
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-from-router
  namespace: app-prod
spec:
  podSelector:
    matchLabels:
      app: web-frontend
  policyTypes:
    - Ingress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              network.openshift.io/policy-group: ingress

Image Security and Supply Chain


# Image policy to restrict registries
apiVersion: config.openshift.io/v1
kind: Image
metadata:
  name: cluster
spec:
  registrySources:
    allowedRegistries:
      - registry.redhat.io
      - registry.access.redhat.com
      - quay.io
      - ghcr.io
      - registry.internal.example.com
    blockedRegistries:
      - docker.io  # Block Docker Hub for compliance
---
# Require signed images in production
apiVersion: policy.sigstore.dev/v1alpha1
kind: ClusterImagePolicy
metadata:
  name: require-signatures
spec:
  images:
    - glob: "registry.internal.example.com/prod/**"
  authorities:
    - keyless:
        url: https://fulcio.sigstore.dev
        identities:
          - issuer: https://accounts.google.com
            subject: release-team@example.com

Pod Security Standards


# Enforce restricted security context
apiVersion: v1
kind: Namespace
metadata:
  name: app-prod
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted
---
# Security Context Constraints for OpenShift
apiVersion: security.openshift.io/v1
kind: SecurityContextConstraints
metadata:
  name: app-restricted
allowPrivilegedContainer: false
allowPrivilegeEscalation: false
requiredDropCapabilities:
  - ALL
runAsUser:
  type: MustRunAsNonRoot
seLinuxContext:
  type: MustRunAs
fsGroup:
  type: MustRunAs
volumes:
  - configMap
  - emptyDir
  - projected
  - secret
  - downwardAPI
  - persistentVolumeClaim

Networking: Where Most Migrations Fail

Ingress and Load Balancer Alignment

External load balancers must align with OpenShift router expectations. Health checks should target readiness endpoints, not TCP ports.


# HAProxy configuration for OpenShift routers
frontend openshift_router_https
    bind *:443
    mode tcp
    option tcplog
    default_backend openshift_router_https_backend

backend openshift_router_https_backend
    mode tcp
    balance source
    option httpchk GET /healthz/ready HTTP/1.1\r\nHost:\ router-health
    http-check expect status 200
    server router-0 192.168.1.10:443 check port 1936 inter 5s fall 3 rise 2
    server router-1 192.168.1.11:443 check port 1936 inter 5s fall 3 rise 2
    server router-2 192.168.1.12:443 check port 1936 inter 5s fall 3 rise 2

Common failure: Load balancer marks routers healthy while the application is unavailable. TCP health checks pass even when the router pod is terminating.


MTU and Overlay Networking

MTU mismatches between underlay, NSX, and OpenShift overlays cause:

  • Intermittent pod-to-pod packet loss
  • gRPC failures (large payloads fragment incorrectly)
  • Random CI/CD pipeline timeouts
  • TLS handshake failures

# Verify MTU across the path
# Physical network: 9000 (jumbo frames)
# NSX overlay: 8900 (100 byte overhead)
# OpenShift OVN: 8800 (additional 100 byte overhead)

# Test from inside a pod
kubectl exec -it debug-pod -- ping -M do -s 8772 target-service

# If this fails, reduce MTU until it works
# Then configure cluster network appropriately

# OpenShift cluster network configuration
apiVersion: operator.openshift.io/v1
kind: Network
metadata:
  name: cluster
spec:
  clusterNetwork:
    - cidr: 10.128.0.0/14
      hostPrefix: 23
  serviceNetwork:
    - 172.30.0.0/16
  defaultNetwork:
    type: OVNKubernetes
    ovnKubernetesConfig:
      mtu: 8800
      genevePort: 6081

This is almost never diagnosed correctly on first pass. Symptoms look like application bugs.


DNS Configuration for Migration


# CoreDNS custom configuration for migration
apiVersion: v1
kind: ConfigMap
metadata:
  name: dns-custom
  namespace: openshift-dns
data:
  legacy.server: |
    legacy.example.com:53 {
      forward . 10.0.0.53 10.0.0.54
      cache 30
    }

During migration, pods may need to resolve legacy DNS names. Configure forwarding rules before cutting over applications.


Storage: Persistent Volumes and CSI Reality

StorageClass Design


# Pure Storage FlashArray - Fast tier
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: pure-fast
  annotations:
    storageclass.kubernetes.io/is-default-class: "false"
provisioner: pure-csi
parameters:
  backend: flasharray
  csi.storage.k8s.io/fstype: xfs
  createoptions: -q
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
# Pure Storage FlashBlade - Shared/NFS tier
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: pure-shared
provisioner: pure-csi
parameters:
  backend: flashblade
  exportRules: "*(rw,no_root_squash)"
reclaimPolicy: Retain
volumeBindingMode: Immediate
---
# Standard tier for non-critical workloads
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: standard
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: kubernetes.io/vsphere-volume
parameters:
  diskformat: thin
  datastore: vsanDatastore
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

WaitForFirstConsumer is critical for block storage. Without it, volumes are bound before pod placement, breaking topology-aware scheduling.

What breaks if ignored:

  • Pods stuck in Pending state
  • Volumes attached to unreachable nodes
  • Zone-aware deployments fail silently

Stateful Application Migration


# Database migration pattern using PVC cloning
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: db-data-migrated
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: pure-fast
  resources:
    requests:
      storage: 500Gi
  dataSource:
    kind: PersistentVolumeClaim
    name: db-data-legacy

Observability and Migration Validation

Baseline Metrics Before Migration

You cannot validate a migration without knowing what normal looks like. Capture baselines for at least two weeks before migration.


# Key metrics to baseline
# Application metrics
- request_duration_seconds (p50, p95, p99)
- request_total (rate)
- error_total (rate)
- active_connections

# Infrastructure metrics
- cpu_usage_percent
- memory_usage_bytes
- disk_io_seconds
- network_bytes_transmitted
- network_bytes_received

# Business metrics
- transactions_per_second
- successful_checkouts
- user_sessions_active

Prometheus Rules for Migration


apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: migration-validation
  namespace: openshift-monitoring
spec:
  groups:
    - name: migration.rules
      rules:
        # Alert if latency increases more than 20% post-migration
        - alert: MigrationLatencyRegression
          expr: |
            (
              histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{migrated="true"}[5m])) by (le, service))
              /
              histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{migrated="false"}[5m])) by (le, service))
            ) > 1.2
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Latency regression detected post-migration"
            description: "Service {{ $labels.service }} p95 latency increased by more than 20%"

        # Alert on error rate increase
        - alert: MigrationErrorRateIncrease
          expr: |
            (
              sum(rate(http_requests_total{status=~"5..", migrated="true"}[5m])) by (service)
              /
              sum(rate(http_requests_total{migrated="true"}[5m])) by (service)
            ) > 0.01
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Error rate exceeded 1% post-migration"

Grafana Dashboard for Migration


# Dashboard JSON snippet for migration comparison
{
  "panels": [
    {
      "title": "Request Latency Comparison",
      "type": "timeseries",
      "targets": [
        {
          "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{env='legacy'}[5m])) by (le))",
          "legendFormat": "Legacy p95"
        },
        {
          "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{env='openshift'}[5m])) by (le))",
          "legendFormat": "OpenShift p95"
        }
      ]
    },
    {
      "title": "Error Rate Comparison",
      "type": "stat",
      "targets": [
        {
          "expr": "sum(rate(http_requests_total{status=~'5..', env='openshift'}[5m])) / sum(rate(http_requests_total{env='openshift'}[5m]))",
          "legendFormat": "OpenShift Error Rate"
        }
      ]
    }
  ]
}

Log Aggregation for Troubleshooting


# Loki configuration for migration logs
apiVersion: loki.grafana.com/v1
kind: LokiStack
metadata:
  name: logging-loki
  namespace: openshift-logging
spec:
  size: 1x.small
  storage:
    schemas:
      - version: v12
        effectiveDate: "2024-01-01"
    secret:
      name: logging-loki-s3
      type: s3
  storageClassName: pure-fast
  tenants:
    mode: openshift-logging

CI/CD and GitOps: What Actually Works

Immutable Image Promotion

Do not rebuild images per environment. Build once, scan once, promote through environments.


# Tekton pipeline for build-once promotion
apiVersion: tekton.dev/v1beta1
kind: Pipeline
metadata:
  name: build-and-promote
spec:
  params:
    - name: git-revision
      type: string
  tasks:
    - name: build
      taskRef:
        name: buildah
      params:
        - name: IMAGE
          value: "registry.internal/app:$(params.git-revision)"

    - name: scan
      taskRef:
        name: trivy-scan
      runAfter:
        - build

    - name: sign
      taskRef:
        name: cosign-sign
      runAfter:
        - scan

    - name: promote-to-dev
      taskRef:
        name: skopeo-copy
      runAfter:
        - sign
      params:
        - name: srcImage
          value: "registry.internal/app:$(params.git-revision)"
        - name: destImage
          value: "registry.internal/app:dev"

If you rebuild per environment:

  • Debugging becomes impossible (which build has the bug?)
  • Security attestation is meaningless
  • Promotion is not promotion, it is a new deployment

ArgoCD Application Example


apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: app-prod
  namespace: openshift-gitops
spec:
  project: production
  destination:
    namespace: app-prod
    server: https://kubernetes.default.svc
  source:
    repoURL: https://github.com/org/app-config
    targetRevision: main
    path: overlays/prod
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=false
      - PrunePropagationPolicy=foreground
      - PruneLast=true
  ignoreDifferences:
    - group: apps
      kind: Deployment
      jsonPointers:
        - /spec/replicas  # Allow HPA to control replicas

Self-heal is not optional in regulated or audited environments. Manual drift is operational debt that compounds.

Environment Promotion with Kustomize


# Base kustomization
# base/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - deployment.yaml
  - service.yaml
  - configmap.yaml

# Production overlay
# overlays/prod/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - ../../base
patches:
  - patch: |
      - op: replace
        path: /spec/replicas
        value: 5
    target:
      kind: Deployment
      name: app
images:
  - name: app
    newName: registry.internal/app
    newTag: v1.2.3  # Pinned version, updated by CI

Rollback Strategy

Application-Level Rollback


# ArgoCD rollback to previous version
argocd app history app-prod
argocd app rollback app-prod <revision>

# Or using kubectl
kubectl rollout undo deployment/app -n app-prod
kubectl rollout status deployment/app -n app-prod

Traffic-Based Rollback


# OpenShift route for blue-green deployment
apiVersion: route.openshift.io/v1
kind: Route
metadata:
  name: app
  namespace: app-prod
spec:
  to:
    kind: Service
    name: app-green
    weight: 100
  alternateBackends:
    - kind: Service
      name: app-blue
      weight: 0
---
# To rollback, shift traffic back to blue
# oc patch route app -p '{"spec":{"to":{"weight":0},"alternateBackends":[{"kind":"Service","name":"app-blue","weight":100}]}}'

Full Migration Rollback

For critical systems, maintain the ability to roll back the entire migration for a defined period.


# Rollback checklist
rollback_criteria:
  - error_rate > 5% for 15 minutes
  - p99_latency > 2x baseline for 30 minutes
  - data_integrity_check_failed
  - critical_integration_broken

rollback_procedure:
  1. Announce rollback decision
  2. Stop writes to new system (if applicable)
  3. Verify data sync to legacy is current
  4. Switch DNS/load balancer to legacy
  5. Verify legacy system health
  6. Communicate rollback complete
  7. Schedule post-mortem

rollback_window: 14 days  # Maintain legacy systems for 2 weeks post-migration

Data Rollback Considerations


# Continuous data sync for rollback capability
apiVersion: batch/v1
kind: CronJob
metadata:
  name: data-sync-to-legacy
spec:
  schedule: "*/5 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: sync
              image: registry.internal/data-sync:latest
              env:
                - name: SOURCE_DB
                  value: "postgresql://new-db:5432/app"
                - name: TARGET_DB
                  value: "postgresql://legacy-db:5432/app"
                - name: SYNC_MODE
                  value: "incremental"
          restartPolicy: OnFailure

Key principle: Never decommission legacy systems until the rollback window has passed and stakeholders have signed off.


Migration Execution: What People Underestimate

State and Cutover

Databases and stateful services require parallel runs and controlled traffic switching. DNS TTLs must be reduced days in advance, not minutes.

Most outages during migration are caused by:

  • Hidden hard-coded IPs in application configs, scripts, and cron jobs
  • Legacy authentication dependencies (service accounts with IP-based trust)
  • Assumed local storage paths that do not exist in containers
  • Timezone differences between legacy VMs and containers (UTC default)
  • Environment variables that were set manually and never documented

Communication Plan


# Migration communication template
stakeholders:
  - business_owners
  - development_teams
  - operations
  - security
  - support

communications:
  - timing: T-14 days
    message: "Migration scheduled, review runbook"
    audience: all

  - timing: T-2 days
    message: "DNS TTL reduced, final validation"
    audience: operations, development

  - timing: T-0 (cutover)
    message: "Migration in progress, reduced SLA"
    audience: all

  - timing: T+1 hour
    message: "Initial validation complete"
    audience: all

  - timing: T+24 hours
    message: "Migration successful, monitoring continues"
    audience: all

Operational Testing (Non-Negotiable)

Before production:

  • Kill a control plane node and verify automatic recovery
  • Force etcd leader re-election during load
  • Simulate storage controller failure
  • Drain workers during peak load
  • Test certificate rotation
  • Verify backup and restore procedures
  • Run security scan and penetration test

# Chaos testing example with Litmus
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: control-plane-chaos
  namespace: litmus
spec:
  engineState: active
  appinfo:
    appns: openshift-etcd
    applabel: app=etcd
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "60"
            - name: CHAOS_INTERVAL
              value: "10"
            - name: FORCE
              value: "true"

If the platform team is afraid to do this, the cluster is not ready.


In short…

OpenShift migration is not a technology project. It is an operational transformation that happens to involve technology.

The patterns in this post exist because I have seen the alternatives fail. Every shortcut skipping reservations, ignoring kernel tuning, compressing testing phases creates debt that surfaces as production incidents.

Key principles:

  • Infrastructure must be deterministic before OpenShift installation
  • Security is architecture, not an afterthought
  • Migration strategy matters more than migration speed
  • Observability validates success; without baselines, you are guessing
  • Rollback capability is not optional for production systems
  • Test failure modes before they test you

The goal is not to move workloads. The goal is to move workloads without moving your problems with them and without creating new ones.