Skip to main content

Kubernetes Infrastructure

Edd Cloud runs on a K3s Kubernetes cluster with mixed architecture nodes. All workloads run in the core namespace.

Cluster Overview

NodeArchitectureLabelsRole
s0amd64db-role=primary, core-services=true, gfs-master=trueDatabase primary, GFS master, gateway, HAProxy
rp1arm64db-role=replicaDatabase replica
rp2arm64backend=true, core-services=trueBackend services
rp3arm64backend=trueBackend services
rp4arm64backend=trueBackend services
s1amd64role=chunkserver, control-plane, etcdGFS chunkserver, control plane
s2amd64role=chunkserver, control-plane, etcdGFS chunkserver, control plane
s3amd64role=chunkserver, control-plane, etcdGFS chunkserver, control plane

Node Scheduling

Services are scheduled based on node labels:

# Core services (s0, rp2)
nodeSelector:
core-services: "true"

# Backend workloads (rp2, rp3, rp4)
nodeSelector:
backend: "true"

# Database (rp1)
nodeSelector:
db-role: replica

# GFS master (s0)
nodeSelector:
gfs-master: "true"

# GFS chunkservers (s1, s2, s3)
nodeSelector:
role: chunkserver

# Small services
nodeSelector:
size: mini

Core Services (s0, rp2)

  • gateway
  • gfs-master
  • postgres (primary)
  • notification-service

Backend Services (rp2, rp3, rp4)

  • auth-service
  • simple-file-share-backend
  • simple-file-share-frontend
  • edd-compute
  • cluster-monitor
  • log-service
  • edd-cloud-docs
  • alerting-service

Database (s0, rp1)

  • postgres (primary, s0)
  • postgres-replica (rp1)
  • haproxy (connection pooling, s0)

NATS

  • nats (size=mini node)

GFS Chunkservers (s1, s2, s3)

  • gfs-chunkserver (hostNetwork DaemonSet)
  • k3s control plane + etcd

Deployments

Application Deployments

NAME                         READY   NODES
gateway 1/1 s0
gfs-master 1/1 s0
postgres 1/1 s0
auth-service 1/1 rp{2-4}
simple-file-share-backend 2/2 rp{2-4}
simple-file-share-frontend 2/2 rp{2-4}
edd-compute 2/2 rp{2-4}
cluster-monitor 2/2 rp{2-4}
log-service 1/1 rp{2-4}
notification-service 1/1 rp2 (core-services)
edd-cloud-docs 2/2 rp{2-4}
alerting-service 1/1 rp{2-4}
nats 1/1 (size=mini)
postgres-replica 1/1 rp1
haproxy 1/1 s0

Database

PostgreSQL runs in a primary-replica configuration with streaming replication:

  • Primary: postgres on s0 (db-role=primary)
  • Replica: postgres-replica on rp1 (db-role=replica), streaming from s0

The postgres service routes to the primary. HAProxy provides connection pooling with automatic failover (primary active, replica backup).

# postgres (PRIMARY on s0)
replicas: 1
image: postgres:16-alpine
nodeSelector:
db-role: primary

# postgres-replica (REPLICA on rp1)
replicas: 1
image: postgres:16-alpine
nodeSelector:
db-role: replica

HAProxy provides connection pooling and failover on s0 (co-located with the primary for minimal latency):

# HAProxy on s0
replicas: 1
nodeSelector:
db-role: primary

GFS Chunkservers

GFS chunkservers run as a DaemonSet with hostNetwork: true on s1, s2, and s3:

# DaemonSet
NAME DESIRED CURRENT NODES
gfs-chunkserver 3 3 s1, s2, s3

Persistent Storage

PostgreSQL

Each database node has its own 5Gi PVC:

# Primary (s0)
name: postgres-data
# Replica (rp1)
name: postgres-replica-data

spec:
accessModes: [ReadWriteOnce]
storageClassName: local-path
resources:
requests:
storage: 5Gi

GFS Master

volumes:
- name: master-data
hostPath:
path: /data/gfs
type: DirectoryOrCreate

GFS Chunkservers

Each chunkserver uses a hostPath volume on its node:

volumes:
- name: chunk-data
hostPath:
path: /data/gfs/chunkserver
type: DirectoryOrCreate

NATS JetStream

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: nats-data
spec:
accessModes: [ReadWriteOnce]
storageClassName: local-path
resources:
requests:
storage: 5Gi

Services

ServiceTypePorts
gatewayLoadBalancer80, 443, 2222, 8000-8999
auth-serviceClusterIP80
simple-file-share-backendClusterIP80
simple-file-share-frontendClusterIP80
edd-computeClusterIP80
cluster-monitorClusterIP80
log-serviceClusterIP50051 (gRPC), 80 (HTTP)
notification-serviceClusterIP80
alerting-serviceClusterIP80
edd-cloud-docsClusterIP80
postgresClusterIP5432
postgres-replicaClusterIP5432
haproxyClusterIP5432
gfs-masterClusterIP + NodePort9000, 30900
natsClusterIP4222 (client), 8222 (monitor)

Network Policies

Core Namespace Isolation

The core namespace has a NetworkPolicy restricting ingress:

  • Allow all traffic within the core namespace (pod-to-pod)
  • Allow traffic from node network (192.168.0.0/16)
  • Allow traffic from pod overlay network (10.42.0.0/16)
  • Allow external traffic on gateway ports: 2222 (SSH), 18080 (HTTP), 8443 (HTTPS)

GFS Chunkserver Access

A Calico GlobalNetworkPolicy (allow-host-chunkserver) permits traffic on:

  • Port 9080 (GFS client)
  • Port 9081 (GFS replication)
  • Port 22 (SSH)
  • Port 6443 (K3s API)
  • Port 10250 (Kubelet)

MetalLB

MetalLB provides LoadBalancer IP allocation in L2 mode:

apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
name: compute-pool
namespace: metallb-system
spec:
addresses:
- 192.168.3.150-192.168.3.200

Secrets

SecretPurpose
postgres-credentialsPostgreSQL admin and replication passwords
compute-db-credentialsCompute service database access
auth-db-credentialsAuth service database access
sfs-db-credentialsFile sharing service database access
notification-db-credentialsNotification service database access
edd-cloud-authJWT_SECRET, default credentials
edd-cloud-adminAdmin username (shared across services)
service-api-keyInter-service authentication key
gfs-jwt-secretGFS JWT signing secret
discord-webhook-urlDiscord webhook for alerting
eddisonso-wildcard-tlsWildcard TLS certificate
regcredDocker registry credentials

Maintenance

Image Cleanup CronJob

Unused container images are pruned daily across all nodes:

apiVersion: batch/v1
kind: CronJob
metadata:
name: image-cleanup
spec:
schedule: "0 3 * * *"
concurrencyPolicy: Forbid

Runs crictl rmi --prune on each node via a privileged container with host access. Uses tolerations: [{operator: Exists}] to schedule on all nodes including control-plane.

CI/CD

Deployments are automated via GitHub Actions:

  1. Push to main branch
  2. GitHub Actions detects changed services
  3. Docker images built and tagged with UTC timestamp (YYYYMMDD-HHMMSS)
  4. Images pushed to Docker Hub
  5. Kubernetes deployment updated via kubectl set image
# CI/CD generates timestamp tags - never use 'latest'
- name: Deploy to Kubernetes
run: |
kubectl --kubeconfig=kubeconfig set image deployment/myapp \
myapp=eddisonso/myapp:20260215-062053

Images must always use timestamp tags (e.g., eddisonso/ecloud-auth:20260215-062053). The latest tag on Docker Hub may be stale.