Incident Report: s0 Node Failure
Date: January 25, 2026 Duration: ~2 hours Severity: Critical Status: Resolved
Summary
Node s0 (192.168.3.100) became unresponsive, causing a complete site outage. s0 hosted critical services including the PostgreSQL primary database, gateway, and GFS master. The outage required manual intervention to promote the database replica and reschedule services to other nodes.
Timeline
All times in EST (UTC-5)
Day 1 - January 23, 2026
| Time (EST) | Event |
|---|---|
| 21:07 | Last recorded SSH activity on s1 (K3s control plane) |
| 23:03 | s0 kubelet's last heartbeat to K3s API |
| 23:08 | s0 marked as NotReady; pods begin terminating |
| 23:08 | Gateway pod on s0 starts terminating; new pod stuck Pending |
| 23:08 | PostgreSQL primary on s0 becomes unavailable |
| 23:08 | gfs-master on s0 becomes unavailable |
| 23:08 | Site goes down - gateway cannot route traffic |
Day 2 - January 24, 2026
| Time (EST) | Event |
|---|---|
| 12:12 | Issue discovered while debugging unrelated auth problem |
| 12:12 | kubectl get nodes shows s0 as NotReady |
| 12:15 | s1 (K3s control plane) also found unresponsive |
| 12:16 | s1 restarted manually |
| 12:16 | Cluster access restored via s1 |
| 12:17 | Investigation into s0 begins |
| 16:40 | GitHub Actions workflow triggered to deploy fixes |
| 16:43 | Deploy step fails - cannot reach K3s API (s0 down again) |
| 17:19 | Manual workflow re-trigger attempted |
| 17:20 | s0 confirmed still down |
| 23:26-23:48 | Multiple manual physical reboots attempted to recover s0 |
| 17:22 | Gateway found in CrashLoopBackOff (read-only DB) |
| 17:22 | Postgres primary pod Pending (nodeSelector: s0) |
| 17:23 | Decision made to failover database to rp1 |
| 17:25 | core-services=true label added to rp2 |
| 17:28 | PostgreSQL replica on rp1 promoted via SELECT pg_promote() |
| 17:29 | postgres service patched to point to postgres-replica |
| 17:29 | Gateway pod deleted and rescheduled |
| 17:30 | role=backend label added to rp3 |
| 17:30 | gfs-master rescheduled to rp3 |
| 17:31 | Gateway comes up successfully |
| 17:32 | Site restored |
| 17:33 | Frontend and backend deployments updated |
| 17:35 | Manifests updated to reflect new architecture |
| 17:40 | Read replicas created on rp2, rp3, rp4 |
Resolution Time
- Time to detection: ~13 hours (overnight, no alerting)
- Time to resolution: ~25 minutes (from decision to failover)
- Total outage duration: ~18 hours
Impact
- Complete site outage for approximately 15+ hours (undetected overnight)
- Users unable to access cloud.eddisonso.com
- No data loss (replica was in sync with primary)
Root Cause
Update (January 26, 2026): Root cause identified after recurring network failures.
The network interface on s0 repeatedly failed due to a missing DHCPv6 client. The cloud-init generated network configuration (/etc/network/interfaces.d/50-cloud-init) included:
iface enp2s0 inet6 dhcp
When ifup attempted to bring up the interface:
- IPv4 DHCP succeeded
- IPv6 DHCP failed with:
No DHCPv6 client software found! - The entire interface was marked as "failed to bring up"
- Network connectivity was lost, causing the node to appear dead to the cluster
This explains why the node was "unresponsive" but could be recovered with a reboot - the OS was running, but the network interface was down.
Fix Applied
- Removed the
inet6 dhcpline from the network configuration - Disabled cloud-init network management to prevent regeneration:
echo "network: {config: disabled}" > /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg
What Went Wrong
1. Single Point of Failure
Critical services were concentrated on s0 with no automatic failover:
- PostgreSQL primary (only writable database)
- Gateway (only ingress point)
- GFS master (chunk metadata)
2. GFS Chunk Metadata Lost
The GFS master stored all chunk metadata (file-to-chunk mappings) in memory/local storage on s0. When s0 went down:
- All file metadata was lost
- Chunk data still exists on chunkservers (s1, s2, s3) but is now orphaned
- Files cannot be reconstructed without metadata
- This is a critical data loss scenario
3. No Alerting
There was no monitoring or alerting configured to detect:
- Node failures
- Pod scheduling failures
- Service unavailability
The outage went unnoticed for ~15 hours overnight.
4. Rigid Node Selectors
Services used strict nodeSelector constraints that prevented automatic rescheduling:
db-role: primary- only s0core-services: true- only s0role: backend- only s0
5. GitHub Actions Deployment Failures Hidden
The CI/CD pipeline's deploy step used || true which silently ignored connection failures to the Kubernetes API, masking deployment issues.
What Went Well
1. Database Replica Available
The PostgreSQL streaming replica on rp1 was fully synchronized with the primary. This enabled:
- Zero data loss
- Quick promotion to primary (
SELECT pg_promote())
2. Manual Failover Was Straightforward
Once the decision was made to failover:
- Promoting the replica took seconds
- Updating the service selector was a single kubectl patch
- Gateway recovered immediately after DB became writable
3. Documentation Existed
The CLAUDE.md file documented the node layout and architecture, which helped in understanding the cluster topology during the incident.
Action Items
Immediate (Completed)
- Promote rp1 PostgreSQL replica to primary
- Add
core-services=truelabel to rp2 - Add
role=backendlabel to rp3 - Update manifests to reflect new architecture
- Reschedule gateway and gfs-master
Short-term
- Create read replicas on rp2, rp3, rp4 for redundancy
- Remove
|| truefrom CI/CD deploy steps or add proper error handling
Long-term
- Implement automatic database failover (Patroni or similar)
- Distribute critical services across multiple nodes
- Add pod anti-affinity rules to spread replicas
- Implement health checks for external monitoring
- Consider running multiple gateway replicas
- Implement GFS master WAL replication for metadata redundancy
- Consider storing GFS metadata in distributed key-value store (etcd/FoundationDB)
Architecture Changes
Before (s0-dependent)
s0 (down)
├── postgres (primary) - SPOF
├── gateway - SPOF
└── gfs-master - SPOF
rp1
└── postgres-replica (read-only)
After (distributed)
rp1
└── postgres-replica (promoted to primary)
rp2
├── gateway
└── postgres-replica-2 (planned)
rp3
├── gfs-master
└── postgres-replica-3 (planned)
rp4
└── postgres-replica-4 (planned)
January 26 Recurrence - Architecture Validation
On January 26, 2026, s0 experienced another network failure before the root cause was identified and fixed. This time, the website (cloud.eddisonso.com) remained accessible thanks to the architecture changes made after the initial incident:
| Service | Impact |
|---|---|
| Frontend (cloud.eddisonso.com) | No impact - gateway running on rp2 |
| Auth API | No impact |
| Storage API | No impact |
| Compute API | Unavailable - requires GFS master on s0 |
The compute service was the only major service affected because it depends on the GFS master node, which was still pinned to s0. This validates that distributing services across nodes significantly improves resilience.
Remaining Work
- Move GFS master to a different node or implement redundancy
- Add replica scheduling for compute-related services
Lessons Learned
-
Never rely on a single node for critical services - Even with replicas, the inability to automatically failover created extended downtime.
-
Alerting is not optional - A 15-hour outage went unnoticed. Basic uptime monitoring would have detected this in minutes.
-
Test failover procedures - The manual failover was successful but had never been tested. Regular DR drills should be scheduled.
-
CI/CD should fail loudly - Silent failures in deployment pipelines mask real issues and create false confidence.
-
Check network configuration carefully - IPv6 configuration on an IPv4-only network caused silent failures that were difficult to diagnose.