Skip to main content

Cluster Monitor

The Cluster Monitor service provides real-time metrics and health information for the Kubernetes cluster.

Features

  • Node Metrics: CPU, memory, disk usage per node
  • Pod Metrics: Resource usage per pod
  • Health Status: Node conditions and pressure indicators
  • Real-time Streaming: SSE-based metrics updates
  • Event Publishing: Publishes cluster metrics to NATS for alerting-service consumption

Architecture

Authentication and Access Control

Cluster Monitor enforces JWT-based authentication and role-based access on all metrics endpoints:

  • Admin-only endpoints — require a valid JWT with IsAdmin: true (issued by the auth service). Non-admin tokens receive 403. Unauthenticated requests receive 401.
  • Authenticated endpoints (own pods) — require a valid JWT. Results are automatically filtered to the caller's own containers (compute-{userID}-* namespaces). The core system namespace is excluded for non-admins.
  • /healthz — unauthenticated health probe only; no metrics data.

The IsAdmin claim is set by the auth service at login time based on the ADMIN_USERNAME environment variable. Admin tokens must be reissued (re-login) to pick up this claim.

API Endpoints

REST

EndpointAuthDescription
GET /cluster-infoAdmin onlyCurrent node metrics for all cluster nodes (JSON)
GET /pod-metricsJWT requiredPod metrics filtered to the caller's own containers
GET /api/metrics/nodesAdmin onlyNode metrics (same data as /cluster-info, REST path)
GET /api/metrics/podsJWT requiredPod metrics; namespace query param honored only for the caller's own namespaces
GET /api/graph/dependenciesAdmin onlyCluster service dependency graph
GET /healthzNoneLiveness/readiness health check

SSE (Real-time)

EndpointAuthDescription
GET /sse/healthJWT requiredCombined cluster + pod metrics stream (pods filtered to caller's own)
GET /sse/cluster-infoAdmin onlyNode metrics stream for all cluster nodes
GET /sse/pod-metricsJWT requiredPod metrics stream filtered to the caller's own containers

WebSocket (Legacy)

EndpointAuthDescription
WS /ws/cluster-infoAdmin onlyNode metrics WebSocket (all nodes)
WS /ws/pod-metricsJWT requiredPod metrics WebSocket filtered to the caller's own containers

Metrics

Node Metrics

{
"timestamp": "2024-01-19T12:34:56Z",
"nodes": [
{
"name": "s0",
"cpu_usage": "500m",
"cpu_capacity": "4",
"cpu_percent": 12.5,
"memory_usage": "2Gi",
"memory_capacity": "8Gi",
"memory_percent": 25.0,
"disk_usage": 10737418240,
"disk_capacity": 107374182400,
"disk_percent": 10.0,
"conditions": [
{"type": "MemoryPressure", "status": "False"},
{"type": "DiskPressure", "status": "False"}
]
}
]
}

Pod Metrics

Non-admin callers receive only pods in their own compute-{userID}-* namespace(s). The core system namespace is visible to admins only.

{
"timestamp": "2024-01-19T12:34:56Z",
"pods": [
{
"name": "myapp-abc123",
"namespace": "compute-usr_abc123-main",
"node": "rp2",
"cpu_usage": 50000000,
"cpu_capacity": 4000000000,
"memory_usage": 67108864,
"memory_capacity": 8589934592,
"disk_usage": 1048576,
"disk_capacity": 107374182400
}
]
}

Combined Health Stream

The /sse/health endpoint combines both metrics types to reduce connections:

const eventSource = new EventSource('/sse/health');

eventSource.onmessage = (event) => {
const { type, payload } = JSON.parse(event.data);

if (type === 'cluster') {
// Node metrics
updateNodes(payload.nodes);
} else if (type === 'pods') {
// Pod metrics
updatePods(payload.pods);
}
};

Refresh Interval

Metrics are fetched from the Kubernetes API every 5 seconds (configurable).

Event Publishing

Cluster Monitor publishes metrics to NATS JetStream for consumption by the alerting-service:

Published Subjects

SubjectDescriptionPublish Frequency
cluster.metricsNode CPU, memory, disk, conditionsEvery 5s
cluster.podsPod restart count, OOM statusEvery 5s

Events are serialized as Protocol Buffers (see proto/cluster/events.proto).

NATS Stream Configuration

Stream: CLUSTER
Subjects: cluster.>
Retention: LimitsPolicy
MaxMsgs: 1,000,000
MaxBytes: 1 GB
MaxAge: 7 days
Storage: FileStorage

Cluster Monitor creates the CLUSTER stream on startup if it doesn't exist. Publishing is non-blocking — metrics continue to work even if NATS is unavailable.

Configuration

FlagDescriptionDefault
-addrListen address:8080
-refreshMetrics refresh interval5s
-api-serverKubernetes API server address-
-log-serviceLog service gRPC address for structured logging-
-log-sourceLog source name (pod name)cluster-monitor
-natsNATS server URLnats://nats:4222