Skip to main content

Incident Report: s0 Node Failure

Date: January 25, 2026 Duration: ~2 hours Severity: Critical Status: Resolved

Summary

Node s0 (192.168.3.100) became unresponsive, causing a complete site outage. s0 hosted critical services including the PostgreSQL primary database, gateway, and GFS master. The outage required manual intervention to promote the database replica and reschedule services to other nodes.

Timeline

All times in EST (UTC-5)

Day 1 - January 23, 2026

Time (EST)Event
21:07Last recorded SSH activity on s1 (K3s control plane)
23:03s0 kubelet's last heartbeat to K3s API
23:08s0 marked as NotReady; pods begin terminating
23:08Gateway pod on s0 starts terminating; new pod stuck Pending
23:08PostgreSQL primary on s0 becomes unavailable
23:08gfs-master on s0 becomes unavailable
23:08Site goes down - gateway cannot route traffic

Day 2 - January 24, 2026

Time (EST)Event
12:12Issue discovered while debugging unrelated auth problem
12:12kubectl get nodes shows s0 as NotReady
12:15s1 (K3s control plane) also found unresponsive
12:16s1 restarted manually
12:16Cluster access restored via s1
12:17Investigation into s0 begins
16:40GitHub Actions workflow triggered to deploy fixes
16:43Deploy step fails - cannot reach K3s API (s0 down again)
17:19Manual workflow re-trigger attempted
17:20s0 confirmed still down
23:26-23:48Multiple manual physical reboots attempted to recover s0
17:22Gateway found in CrashLoopBackOff (read-only DB)
17:22Postgres primary pod Pending (nodeSelector: s0)
17:23Decision made to failover database to rp1
17:25core-services=true label added to rp2
17:28PostgreSQL replica on rp1 promoted via SELECT pg_promote()
17:29postgres service patched to point to postgres-replica
17:29Gateway pod deleted and rescheduled
17:30role=backend label added to rp3
17:30gfs-master rescheduled to rp3
17:31Gateway comes up successfully
17:32Site restored
17:33Frontend and backend deployments updated
17:35Manifests updated to reflect new architecture
17:40Read replicas created on rp2, rp3, rp4

Resolution Time

  • Time to detection: ~13 hours (overnight, no alerting)
  • Time to resolution: ~25 minutes (from decision to failover)
  • Total outage duration: ~18 hours

Impact

  • Complete site outage for approximately 15+ hours (undetected overnight)
  • Users unable to access cloud.eddisonso.com
  • No data loss (replica was in sync with primary)

Root Cause

Update (January 26, 2026): Root cause identified after recurring network failures.

The network interface on s0 repeatedly failed due to a missing DHCPv6 client. The cloud-init generated network configuration (/etc/network/interfaces.d/50-cloud-init) included:

iface enp2s0 inet6 dhcp

When ifup attempted to bring up the interface:

  1. IPv4 DHCP succeeded
  2. IPv6 DHCP failed with: No DHCPv6 client software found!
  3. The entire interface was marked as "failed to bring up"
  4. Network connectivity was lost, causing the node to appear dead to the cluster

This explains why the node was "unresponsive" but could be recovered with a reboot - the OS was running, but the network interface was down.

Fix Applied

  1. Removed the inet6 dhcp line from the network configuration
  2. Disabled cloud-init network management to prevent regeneration:
    echo "network: {config: disabled}" > /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg

What Went Wrong

1. Single Point of Failure

Critical services were concentrated on s0 with no automatic failover:

  • PostgreSQL primary (only writable database)
  • Gateway (only ingress point)
  • GFS master (chunk metadata)

2. GFS Chunk Metadata Lost

The GFS master stored all chunk metadata (file-to-chunk mappings) in memory/local storage on s0. When s0 went down:

  • All file metadata was lost
  • Chunk data still exists on chunkservers (s1, s2, s3) but is now orphaned
  • Files cannot be reconstructed without metadata
  • This is a critical data loss scenario

3. No Alerting

There was no monitoring or alerting configured to detect:

  • Node failures
  • Pod scheduling failures
  • Service unavailability

The outage went unnoticed for ~15 hours overnight.

4. Rigid Node Selectors

Services used strict nodeSelector constraints that prevented automatic rescheduling:

  • db-role: primary - only s0
  • core-services: true - only s0
  • role: backend - only s0

5. GitHub Actions Deployment Failures Hidden

The CI/CD pipeline's deploy step used || true which silently ignored connection failures to the Kubernetes API, masking deployment issues.

What Went Well

1. Database Replica Available

The PostgreSQL streaming replica on rp1 was fully synchronized with the primary. This enabled:

  • Zero data loss
  • Quick promotion to primary (SELECT pg_promote())

2. Manual Failover Was Straightforward

Once the decision was made to failover:

  • Promoting the replica took seconds
  • Updating the service selector was a single kubectl patch
  • Gateway recovered immediately after DB became writable

3. Documentation Existed

The CLAUDE.md file documented the node layout and architecture, which helped in understanding the cluster topology during the incident.

Action Items

Immediate (Completed)

  • Promote rp1 PostgreSQL replica to primary
  • Add core-services=true label to rp2
  • Add role=backend label to rp3
  • Update manifests to reflect new architecture
  • Reschedule gateway and gfs-master

Short-term

  • Create read replicas on rp2, rp3, rp4 for redundancy
  • Remove || true from CI/CD deploy steps or add proper error handling

Long-term

  • Implement automatic database failover (Patroni or similar)
  • Distribute critical services across multiple nodes
  • Add pod anti-affinity rules to spread replicas
  • Implement health checks for external monitoring
  • Consider running multiple gateway replicas
  • Implement GFS master WAL replication for metadata redundancy
  • Consider storing GFS metadata in distributed key-value store (etcd/FoundationDB)

Architecture Changes

Before (s0-dependent)

s0 (down)
├── postgres (primary) - SPOF
├── gateway - SPOF
└── gfs-master - SPOF

rp1
└── postgres-replica (read-only)

After (distributed)

rp1
└── postgres-replica (promoted to primary)

rp2
├── gateway
└── postgres-replica-2 (planned)

rp3
├── gfs-master
└── postgres-replica-3 (planned)

rp4
└── postgres-replica-4 (planned)

January 26 Recurrence - Architecture Validation

On January 26, 2026, s0 experienced another network failure before the root cause was identified and fixed. This time, the website (cloud.eddisonso.com) remained accessible thanks to the architecture changes made after the initial incident:

ServiceImpact
Frontend (cloud.eddisonso.com)No impact - gateway running on rp2
Auth APINo impact
Storage APINo impact
Compute APIUnavailable - requires GFS master on s0

The compute service was the only major service affected because it depends on the GFS master node, which was still pinned to s0. This validates that distributing services across nodes significantly improves resilience.

Remaining Work

  • Move GFS master to a different node or implement redundancy
  • Add replica scheduling for compute-related services

Lessons Learned

  1. Never rely on a single node for critical services - Even with replicas, the inability to automatically failover created extended downtime.

  2. Alerting is not optional - A 15-hour outage went unnoticed. Basic uptime monitoring would have detected this in minutes.

  3. Test failover procedures - The manual failover was successful but had never been tested. Regular DR drills should be scheduled.

  4. CI/CD should fail loudly - Silent failures in deployment pipelines mask real issues and create false confidence.

  5. Check network configuration carefully - IPv6 configuration on an IPv4-only network caused silent failures that were difficult to diagnose.