Kubernetes Node Maintenance for Security Patching
Performed Kubernetes node maintenance on GKE to apply security patches, achieving zero downtime and enhanced protection for a critical web application.

Technologies
Challenges
Solutions
Key Results
100% during maintenance
uptime improvement
Kubernetes Node Maintenance for Security Patching
Situation
A Kubernetes cluster running on Google Kubernetes Engine (GKE) hosts a critical web application with high availability requirements. The cluster, consisting of multiple executor and master nodes, is managed by a team of DevOps engineers. A recent security advisory revealed vulnerabilities in the Kubernetes runtime and container images, necessitating immediate patching to prevent potential exploits. The cluster uses Pod Disruption Budgets (PDBs) for high availability, namespaces for resource organization, and network policies for security. Monitoring is handled by Prometheus, with logs centralized using an EFK (Elasticsearch, Fluentd, Kibana) stack. The application is exposed via an Ingress controller with TLS for secure communication. The challenge was to perform node maintenance without disrupting the application’s availability.
Task
The DevOps engineer was tasked with performing maintenance on a specific Kubernetes node to apply security patches to the container runtime and update container images. The maintenance had to:
- Ensure minimal disruption, respecting the Pod Disruption Budget.
- Maintain application availability for users.
- Safely return the node to the cluster post-maintenance.
- Verify the application’s performance and security.
- Document the process and provide a sequence diagram for the maintenance workflow.
Action
The DevOps engineer followed a structured approach, leveraging Kubernetes tools and best practices. Below are the detailed actions taken:
1. Cluster Setup with Terraform
To manage the GKE cluster infrastructure consistently, we used Terraform, ensuring repeatable and secure node provisioning. The Terraform configuration, formatted in YAML-like structure, defined the GKE cluster with security features:
provider "google" {
project = "websecure-inc"
region = "us-central1"
}
resource "google_container_cluster" "healthcare_cluster" {
name = "websecure-gke"
location = "us-central1-a"
network = "default"
initial_node_count = 3
node_config {
machine_type = "e2-standard-4"
oauth_scopes = [
"https://www.googleapis.com/auth/cloud-platform",
]
metadata = {
disable-legacy-endpoints = "true"
}
}
private_cluster_config {
enable_private_nodes = true
enable_private_endpoint = false
master_ipv4_cidr_block = "172.16.0.0/28"
}
master_auth {
client_certificate_config {
issue_client_certificate = false
}
}
}This configuration ensured private nodes and secure cluster settings, aligning with the application’s high availability and security requirements.
2. Preparation and Planning
We identified the target node (e.g., node-1) using:
kubectl get nodesThis listed node status, roles, and resource usage, ensuring accurate selection. We verified the Pod Disruption Budget (PDB) to maintain at least two available pods for the app: web-app label:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: web-app-pdb
namespace: production
spec:
minAvailable: 2
selector:
matchLabels:
app: web-appWe reviewed pod resource requests and limits to confirm other nodes could handle evicted pods:
apiVersion: v1
kind: Pod
metadata:
name: web-app-pod
namespace: production
spec:
containers:
- name: web-app-container
image: web-app:latest
resources:
requests:
memory: "256Mi"
cpu: "500m"
limits:
memory: "512Mi"
cpu: "1000m"We notified stakeholders via Slack to avoid conflicting operations.
3. Cordon the Node
We marked the node as unschedulable using:
kubectl cordon node-1
This prevented new pods from being scheduled on the node, reducing disruption risk.
4. Drain the Node
We safely evicted pods with:
kubectl drain node-1 --ignore-daemonsets --delete-emptydir-data
The --ignore-daemonsets flag preserved Fluentd pods for logging, and --delete-emptydir-data handled emptyDir volumes. We monitored pod eviction using:
kubectl get pods -n production -o wide
This ensured pods were rescheduled correctly, respecting the PDB.
5. Perform Maintenance
We applied security patches by SSHing into the node (using gcloud compute ssh node-1) and running:
apt-get update && apt-get upgrade
This patched the container runtime (e.g., containerd). We updated container images in the deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app-deployment
namespace: production
spec:
replicas: 3
selector:
matchLabels:
app: web-app
template:
metadata:
labels:
app: web-app
spec:
containers:
- name: web-app-container
image: web-app:1.0.1 # Updated image version
resources:
requests:
memory: "256Mi"
cpu: "500m"
limits:
memory: "512Mi"
cpu: "1000m"We restarted the kubelet service:
sudo systemctl restart kubelet
This applied runtime updates.
6. Return Node to Service
We marked the node schedulable again with:
kubectl uncordon node-1
We verified its status using:
kubectl get nodes
This confirmed the node was in the Ready state.
7. Post-Maintenance Validation
We checked application health via the /healthz endpoint, monitored CPU, memory, and latency in Prometheus and Grafana, and reviewed logs in Kibana for errors. We used kube-bench to verify the node’s security configuration post-patching.
8. Documentation and Reporting
We documented the process in Confluence, including commands, timestamps, and observations. A PlantUML sequence diagram was created to visualize the workflow.
Result
The maintenance was completed with zero downtime, maintaining 100% application availability. The node was patched, and container images were updated to secure versions. The Pod Disruption Budget ensured at least two pods remained available. Post-maintenance validation confirmed application health, performance, and security. Prometheus and EFK provided real-time insights, enabling quick issue detection. The documented process and sequence diagram enhanced team preparedness for future maintenance. The application continued to serve users securely and reliably, protected against vulnerabilities.
Architectural Diagram
Need a Similar Solution?
I can help you design and implement similar cloud infrastructure and DevOps solutions for your organization.