Roadmap
Gateway Improvements
- IP-based filtering - Allow/block traffic based on IP address or CIDR range
- Radix tree routing - Replace linear route matching with radix tree for O(log n) lookups
- L4 load balancer pre-ingress - Add TCP/UDP load balancer layer for distributed gateway deployment
- Connection pooling - Reuse backend connections to reduce latency
- HTTP/2 support - Enable gRPC and improved multiplexing
- Distributed gateway - Run multiple gateway replicas across backend nodes with MetalLB load balancing. Removes gateway dependency on s0
- Migrate to C++ - Rewrite gateway in C++ for deterministic memory deallocation; Go's GC does not reclaim connection buffers fast enough under high connection churn, causing heap growth even with bounded allocations
Distributed Gateway Architecture (Future)
Storage Improvements
- Chunk garbage collection - Clean up orphaned chunks (1-hour grace period, via heartbeat)
- Namespace visibility levels - Private (owner only), visible (discoverable), public (read-only access)
- Chunk corruption recovery - Detect corrupted chunks via checksums and re-replicate from healthy replicas
- Tiered storage - Hot/cold data separation
- Storage load balancing - Distribute chunks based on capacity, I/O load, memory, and CPU heuristics
- Programmatic API - Upload and download files via REST API for external integrations
- Chunkserver advertise-host flag - Separate bind address from advertise address to enable Kubernetes deployments without hostNetwork
- Distributed GFS master - Run multiple GFS master replicas with leader election and WAL replication for metadata high availability. Eliminates the single point of failure on s0
Compute Improvements
- Container access control - Users can only access their own containers (SSH, logs, management)
- Container resource limits - CPU/memory quotas per user
- Container networking - Private networks between user containers
- Persistent volumes - User-attached storage volumes
- True VMs - Full virtual machines via Type 1 hypervisor (KVM) for stronger isolation
- Multi-architecture support - Provision compute on different architectures (amd64, arm64)
Infrastructure
- Migrate control plane to s1/s2/s3 - HA control plane with embedded etcd on amd64 nodes
- Multi-master HA - 3-node control plane for high availability
High Availability / Disaster Recovery
Added after 2026-01-25 s0 node failure incident
- GFS master WAL replication - Replicate chunk metadata to standby masters for redundancy; current single-master design means metadata loss on node failure
- GFS metadata in distributed key-value store - Store file/chunk mappings in etcd or FoundationDB instead of local storage for automatic replication and HA
- Automatic database failover - Implement Patroni or similar for automatic PostgreSQL primary promotion
- Service health monitoring - External uptime checks for critical endpoints
- Pod anti-affinity rules - Spread critical service replicas across nodes
- CI/CD deployment failure alerts - Remove silent
|| truefrom kubectl commands
Monitoring
- Distributed tracing - Request tracing across services
- Alerting - Automated alerts for service health (standalone alerting-service, NATS + protobuf, Discord webhooks)
- Log aggregation - Searchable log storage
- Distributed logging aggregation - Replicate log entries across log service replicas so all instances share the same ring buffers and subscribers; enables horizontal scaling without splitting logs across pods
- Delta updates for SSE - Send only changes instead of full state to reduce bandwidth
- Graph-based observability platform - Visualize service dependencies and time-series metrics with graph-based relationships
Auth
- Token-based RBAC - Users can create per-user tokens scoped to specific resource permissions (e.g.
compute.uid.containers: ["create", "read", "delete"],storage.uid.files: ["read"]) for programmatic API access; tokens are tied to the creating user via service accounts - Distributed identity permission store - Auth service pushes service account permissions to compute/storage services via NATS events; services maintain local permission stores for zero-latency token validation without synchronous auth service dependency
- OAuth2 / OpenID Connect - External identity provider integration
Frontend
- Mobile UI - Responsive layout and usability on mobile devices
Future Services
- Message Queue - Pub/sub messaging for async communication between services
- Datastore - NoSQL database for flexible document/key-value storage
- Image Artifactory - Container image registry for storing and distributing user container images