Incident Report: s0 Node Failure

Date: January 25, 2026 Duration: ~2 hours Severity: Critical Status: Resolved

Summary

Node s0 (192.168.3.100) became unresponsive, causing a complete site outage. s0 hosted critical services including the PostgreSQL primary database, gateway, and GFS master. The outage required manual intervention to promote the database replica and reschedule services to other nodes.

Timeline

All times in EST (UTC-5)

Day 1 - January 23, 2026

Time (EST)	Event
21:07	Last recorded SSH activity on s1 (K3s control plane)
23:03	s0 kubelet's last heartbeat to K3s API
23:08	s0 marked as NotReady; pods begin terminating
23:08	Gateway pod on s0 starts terminating; new pod stuck Pending
23:08	PostgreSQL primary on s0 becomes unavailable
23:08	gfs-master on s0 becomes unavailable
23:08	Site goes down - gateway cannot route traffic

Day 2 - January 24, 2026

Time (EST)	Event
12:12	Issue discovered while debugging unrelated auth problem
12:12	`kubectl get nodes` shows s0 as NotReady
12:15	s1 (K3s control plane) also found unresponsive
12:16	s1 restarted manually
12:16	Cluster access restored via s1
12:17	Investigation into s0 begins
16:40	GitHub Actions workflow triggered to deploy fixes
16:43	Deploy step fails - cannot reach K3s API (s0 down again)
17:19	Manual workflow re-trigger attempted
17:20	s0 confirmed still down
23:26-23:48	Multiple manual physical reboots attempted to recover s0
17:22	Gateway found in CrashLoopBackOff (read-only DB)
17:22	Postgres primary pod Pending (nodeSelector: s0)
17:23	Decision made to failover database to rp1
17:25	`core-services=true` label added to rp2
17:28	PostgreSQL replica on rp1 promoted via `SELECT pg_promote()`
17:29	`postgres` service patched to point to postgres-replica
17:29	Gateway pod deleted and rescheduled
17:30	`role=backend` label added to rp3
17:30	gfs-master rescheduled to rp3
17:31	Gateway comes up successfully
17:32	Site restored
17:33	Frontend and backend deployments updated
17:35	Manifests updated to reflect new architecture
17:40	Read replicas created on rp2, rp3, rp4

Resolution Time

Time to detection: ~13 hours (overnight, no alerting)
Time to resolution: ~25 minutes (from decision to failover)
Total outage duration: ~18 hours

Impact

Complete site outage for approximately 15+ hours (undetected overnight)
Users unable to access cloud.eddisonso.com
No data loss (replica was in sync with primary)

Root Cause

Update (January 26, 2026): Root cause identified after recurring network failures.

The network interface on s0 repeatedly failed due to a missing DHCPv6 client. The cloud-init generated network configuration (/etc/network/interfaces.d/50-cloud-init) included:

iface enp2s0 inet6 dhcp

When ifup attempted to bring up the interface:

IPv4 DHCP succeeded
IPv6 DHCP failed with: No DHCPv6 client software found!
The entire interface was marked as "failed to bring up"
Network connectivity was lost, causing the node to appear dead to the cluster

This explains why the node was "unresponsive" but could be recovered with a reboot - the OS was running, but the network interface was down.

Fix Applied

Removed the inet6 dhcp line from the network configuration

Disabled cloud-init network management to prevent regeneration:

echo "network: {config: disabled}" > /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg

What Went Wrong

1. Single Point of Failure

Critical services were concentrated on s0 with no automatic failover:

PostgreSQL primary (only writable database)
Gateway (only ingress point)
GFS master (chunk metadata)

2. GFS Chunk Metadata Lost

The GFS master stored all chunk metadata (file-to-chunk mappings) in memory/local storage on s0. When s0 went down:

All file metadata was lost
Chunk data still exists on chunkservers (s1, s2, s3) but is now orphaned
Files cannot be reconstructed without metadata
This is a critical data loss scenario

3. No Alerting

There was no monitoring or alerting configured to detect:

Node failures
Pod scheduling failures
Service unavailability

The outage went unnoticed for ~15 hours overnight.

4. Rigid Node Selectors

Services used strict nodeSelector constraints that prevented automatic rescheduling:

db-role: primary - only s0
core-services: true - only s0
role: backend - only s0

5. GitHub Actions Deployment Failures Hidden

The CI/CD pipeline's deploy step used || true which silently ignored connection failures to the Kubernetes API, masking deployment issues.

What Went Well

1. Database Replica Available

The PostgreSQL streaming replica on rp1 was fully synchronized with the primary. This enabled:

Zero data loss
Quick promotion to primary (SELECT pg_promote())

2. Manual Failover Was Straightforward

Once the decision was made to failover:

Promoting the replica took seconds
Updating the service selector was a single kubectl patch
Gateway recovered immediately after DB became writable

3. Documentation Existed

The CLAUDE.md file documented the node layout and architecture, which helped in understanding the cluster topology during the incident.

Action Items

Immediate (Completed)

Promote rp1 PostgreSQL replica to primary
Add core-services=true label to rp2
Add role=backend label to rp3
Update manifests to reflect new architecture
Reschedule gateway and gfs-master

Short-term

Create read replicas on rp2, rp3, rp4 for redundancy
Remove || true from CI/CD deploy steps or add proper error handling

Long-term

Implement automatic database failover (Patroni or similar)
Distribute critical services across multiple nodes
Add pod anti-affinity rules to spread replicas
Implement health checks for external monitoring
Consider running multiple gateway replicas
Implement GFS master WAL replication for metadata redundancy
Consider storing GFS metadata in distributed key-value store (etcd/FoundationDB)

Architecture Changes

Before (s0-dependent)

s0 (down)
├── postgres (primary) - SPOF
├── gateway - SPOF
└── gfs-master - SPOF

rp1
└── postgres-replica (read-only)

After (distributed)

rp1
└── postgres-replica (promoted to primary)

rp2
├── gateway
└── postgres-replica-2 (planned)

rp3
├── gfs-master
└── postgres-replica-3 (planned)

rp4
└── postgres-replica-4 (planned)

January 26 Recurrence - Architecture Validation

On January 26, 2026, s0 experienced another network failure before the root cause was identified and fixed. This time, the website (cloud.eddisonso.com) remained accessible thanks to the architecture changes made after the initial incident:

Service	Impact
Frontend (cloud.eddisonso.com)	No impact - gateway running on rp2
Auth API	No impact
Storage API	No impact
Compute API	Unavailable - requires GFS master on s0

The compute service was the only major service affected because it depends on the GFS master node, which was still pinned to s0. This validates that distributing services across nodes significantly improves resilience.

Remaining Work

Move GFS master to a different node or implement redundancy
Add replica scheduling for compute-related services

Lessons Learned

Never rely on a single node for critical services - Even with replicas, the inability to automatically failover created extended downtime.
Alerting is not optional - A 15-hour outage went unnoticed. Basic uptime monitoring would have detected this in minutes.
Test failover procedures - The manual failover was successful but had never been tested. Regular DR drills should be scheduled.
CI/CD should fail loudly - Silent failures in deployment pipelines mask real issues and create false confidence.
Check network configuration carefully - IPv6 configuration on an IPv4-only network caused silent failures that were difficult to diagnose.

Summary​

Timeline​

Day 1 - January 23, 2026​

Day 2 - January 24, 2026​

Resolution Time​

Impact​

Root Cause​

Fix Applied​

What Went Wrong​

1. Single Point of Failure​

2. GFS Chunk Metadata Lost​

3. No Alerting​

4. Rigid Node Selectors​

5. GitHub Actions Deployment Failures Hidden​

What Went Well​

1. Database Replica Available​

2. Manual Failover Was Straightforward​

3. Documentation Existed​

Action Items​

Immediate (Completed)​

Short-term​

Long-term​

Architecture Changes​

Before (s0-dependent)​

After (distributed)​

January 26 Recurrence - Architecture Validation​

Remaining Work​

Lessons Learned​