Skip to main content

Architecture

System Overview

Edd Cloud is a self-hosted cloud platform running on a mixed-architecture K3s cluster. It provides file storage, container compute, authentication, and cluster monitoring through a set of microservices coordinated by a custom gateway and event bus.

Node Layout

The cluster consists of 8 nodes with mixed architectures:

NodeArchitectureOSRoleLabels
s0amd64Debian 13 (kernel 6.12)Database primary, GFS master, gateway, HAProxydb-role=primary, core-services=true, gfs-master=true
rp1arm64Debian 11 (kernel 6.1)Database replicadb-role=replica
rp2arm64Debian 11 (kernel 6.1)Backend servicesbackend=true
rp3arm64Debian 11 (kernel 6.1)Backend servicesbackend=true
rp4arm64Debian 11 (kernel 6.1)Backend servicesbackend=true
s1amd64Debian 13 (kernel 6.12)Control plane, etcd, GFS chunkserverhostNetwork
s2amd64Debian 13 (kernel 6.12)Control plane, etcd, GFS chunkserverhostNetwork
s3amd64Debian 13 (kernel 6.12)Control plane, etcd, GFS chunkserverhostNetwork

Service Distribution

Node(s)Services
s0gateway, gfs-master, postgres (primary), haproxy
rp1postgres-replica
rp2, rp3, rp4auth-service, edd-compute, log-service, cluster-monitor, alerting-service, notification-service, simple-file-share, edd-cloud-docs, edd-registry, nats
s1, s2, s3k3s control plane, etcd, gfs-chunkservers (hostNetwork)

Network Architecture

Routing and Load Balancing

The cluster uses a custom networking stack -- not the default K3s Traefik ingress or ServiceLB:

  • Traefik and k3s ServiceLB (Klipper) are disabled in /etc/rancher/k3s/config.yaml
  • MetalLB (L2 mode) allocates virtual IPs for LoadBalancer-type services
  • Calico provides pod networking using a VXLAN overlay
  • Gateway VIP: 192.168.3.200 (allocated by MetalLB)

All external traffic enters the cluster through the custom edd-gateway, which handles:

  • HTTP/1.1 request routing
  • TLS termination (HTTPS)
  • SSH tunneling (port 2222) for container access
  • WebSocket upgrades

Note: The gateway does not currently support HTTP/2 or gRPC pass-through.

External Domains

DomainPurposeBackend Target
cloud.eddisonso.comMain dashboardsimple-file-share-frontend:80
auth.cloud.eddisonso.comAuthentication APIauth-service:80
storage.cloud.eddisonso.comStorage APIsimple-file-share-backend:80
compute.cloud.eddisonso.comCompute APIedd-compute:80
health.cloud.eddisonso.comHealth/Monitoring API, Log streamingcluster-monitor:80, log-service:80
notifications.cloud.eddisonso.comNotification APInotification-service:80
docs.cloud.eddisonso.comDocumentationedd-cloud-docs:80
registry.cloud.eddisonso.comContainer Registry (OCI v2)edd-registry:80

cloud-api.eddisonso.com is DEPRECATED. Legacy routes on this domain still exist in the gateway configuration but all new development uses the *.cloud.eddisonso.com subdomains.

Internal Services

ServiceTypePortsProtocol
gatewayLoadBalancer80, 443, 2222, 8000-8999HTTP/HTTPS/SSH/Container ingress
auth-serviceClusterIP80HTTP
simple-file-share-backendClusterIP80HTTP
simple-file-share-frontendClusterIP80HTTP
edd-computeClusterIP80HTTP
cluster-monitorClusterIP80HTTP
log-serviceClusterIP50051, 80gRPC, HTTP
notification-serviceClusterIP80HTTP, WebSocket
alerting-serviceClusterIP80HTTP (health checks), NATS consumer
gfs-masterClusterIP9000gRPC
gfs-chunkserver-NhostNetwork9080, 9081TCP (client), TCP (replication)
postgresClusterIP5432PostgreSQL
haproxyClusterIP5432PostgreSQL
edd-registryClusterIP80HTTP (OCI v2)
natsClusterIP4222, 8222NATS, HTTP

DNS

Services communicate internally via Kubernetes DNS:

<service>.<namespace>.svc.cluster.local

Examples:

  • postgres.core.svc.cluster.local:5432
  • gfs-master.core.svc.cluster.local:9000
  • nats.core.svc.cluster.local:4222

Request Flows

Storage Request Flow

  1. Client makes HTTPS request to storage.cloud.eddisonso.com
  2. Gateway terminates TLS and routes to Storage API
  3. Storage API authenticates via JWT
  4. For file operations:
    • Write: Storage API -> GFS Master (allocate chunk) -> Chunkservers (2PC write)
    • Read: Storage API -> GFS Master (get locations) -> Chunkserver (read data)
  5. Response returned to client

Compute Request Flow

  1. Client makes HTTPS request to compute.cloud.eddisonso.com
  2. Gateway terminates TLS and routes to Compute API
  3. Compute API authenticates via JWT
  4. Compute API interacts with Kubernetes API for container operations
  5. Container status updates streamed via WebSocket

SSH Access Flow

  1. Client connects to cloud.eddisonso.com:22 via SSH (port 22 maps to gateway's internal port 2222)
  2. Gateway accepts the SSH connection and authenticates using user-uploaded SSH keys
  3. Gateway proxies the SSH session to the target user container pod

Data Persistence

GFS (Distributed File System)

A custom Go implementation of a distributed file system:

  • Chunk Size: 64MB
  • Replication Factor: 3
  • Consistency: Two-Phase Commit (2PC)
  • Write Quorum: 2 of 3 replicas
  • Master: Runs on s0, manages metadata and chunk placement
  • Chunkservers: Run on s1, s2, s3 using hostNetwork for direct data transfer

Garbage Collection

GFS implements automatic cleanup of orphaned chunks (chunks on disk not tracked by the master):

  1. Chunk Reporting: Chunkservers report all their chunks during registration and periodic heartbeats
  2. Orphan Detection: Master checks each reported chunk against its metadata
  3. Grace Period: Unknown chunks are tracked for 1 hour before deletion (prevents removing in-flight data)
  4. Scheduled Deletion: After grace period, chunks are added to pendingDeletes
  5. Cleanup: On next heartbeat, master returns pending deletes and chunkserver removes the files

This handles scenarios like:

  • Master restart losing in-memory metadata (WAL recovery may miss recent chunks)
  • Partial writes that never committed
  • Manual file deletions that didn't propagate

Service Databases

PostgreSQL runs in a primary-replica configuration with streaming replication:

  • Primary: s0 (postgres deployment)
  • Replica: rp1 (postgres-replica deployment, streaming from s0)
  • HAProxy: Runs on s0 (co-located with primary), provides connection pooling and automatic failover

Each service owns its own database for loose coupling:

ServiceDatabaseData Stored
Authauth_dbUsers, sessions, service accounts
SFSsfs_dbNamespaces, file metadata
Computecompute_dbContainers, SSH keys, ingress rules
Notificationsnotifications_dbUser notifications
Registryregistry_dbRepositories, manifests, tags, blobs, upload sessions
Gatewaygateway_dbStatic routes

Event-Driven Communication

Services communicate asynchronously via NATS JetStream:

Event Subjects

Subject PatternDescription
auth.user.{id}.createdUser created
auth.user.{id}.deletedUser deleted
auth.user.{id}.updatedUser profile updated
notify.{user_id}Push notification to user
cluster.metricsNode CPU, memory, disk metrics
cluster.podsPod restart count, OOM status
log.error.{source}ERROR+ level logs from services

NATS JetStream provides durable subscriptions, ensuring events are not lost if a consumer is temporarily offline.

See Event-Driven Architecture for details.

Security

Authentication

  • JWT-based authentication for all API requests
  • Tokens issued on login, stored in localStorage
  • Token passed via Authorization: Bearer header or query param for SSE/WebSocket

Service Account Tokens

  • Service accounts support programmatic API access
  • Tokens use hierarchical scopes: service.user_id.resource.id
  • Permissions are pushed to backend services via NATS events for zero-latency local validation
  • Example scopes: compute.uid.containers: ["create", "read", "delete"], storage.uid.files: ["read"]

TLS

  • All external traffic encrypted with TLS 1.2+
  • Certificates managed by cert-manager with Let's Encrypt (Cloudflare DNS-01 challenge)
  • Wildcard certificates cover *.eddisonso.com and *.cloud.eddisonso.com

Secrets Management

  • All secrets stored as Kubernetes Secrets (never environment variables)
  • Mounted as files or referenced via secretKeyRef in pod specs
SecretPurpose
postgres-credentialsPostgreSQL admin and replication passwords
compute-db-credentialsCompute service database access
auth-db-credentialsAuth service database access
sfs-db-credentialsFile sharing service database access
notification-db-credentialsNotification service database access
edd-cloud-authJWT_SECRET, default credentials
edd-cloud-adminAdmin username (shared across services)
service-api-keyInter-service authentication key
gfs-jwt-secretGFS JWT signing secret
discord-webhook-urlDiscord webhook for alerting
eddisonso-wildcard-tlsWildcard TLS certificate
regcredDocker registry credentials

CORS

  • Each service implements CORS middleware
  • Origin header reflected for cross-domain requests
  • Credentials allowed for authenticated requests

CI/CD

Deployments are automated via GitHub Actions:

  1. Push to main branch triggers a workflow
  2. GitHub Actions builds the Docker image for the changed service
  3. Image is pushed to Docker Hub
  4. Kubernetes deployment is updated via kubectl set image

Workflow runs are visible at: https://github.com/EddisonSo/cloud/actions