Skip to main content

Cluster Monitor

The Cluster Monitor service provides real-time metrics and health information for the Kubernetes cluster.

Features

  • Node Metrics: CPU, memory, disk usage per node
  • Pod Metrics: Resource usage per pod
  • Health Status: Node conditions and pressure indicators
  • Real-time Streaming: SSE-based metrics updates
  • Event Publishing: Publishes cluster metrics to NATS for alerting-service consumption

Architecture

API Endpoints

REST

EndpointDescription
GET /cluster-infoCurrent node metrics (JSON)
GET /pod-metricsCurrent pod metrics (JSON)
GET /healthzHealth check

SSE (Real-time)

EndpointDescription
GET /sse/healthCombined cluster + pod metrics stream
GET /sse/cluster-infoNode metrics stream
GET /sse/pod-metricsPod metrics stream

WebSocket (Legacy)

EndpointDescription
WS /ws/cluster-infoNode metrics WebSocket
WS /ws/pod-metricsPod metrics WebSocket

Metrics

Node Metrics

{
"timestamp": "2024-01-19T12:34:56Z",
"nodes": [
{
"name": "s0",
"cpu_usage": "500m",
"cpu_capacity": "4",
"cpu_percent": 12.5,
"memory_usage": "2Gi",
"memory_capacity": "8Gi",
"memory_percent": 25.0,
"disk_usage": 10737418240,
"disk_capacity": 107374182400,
"disk_percent": 10.0,
"conditions": [
{"type": "MemoryPressure", "status": "False"},
{"type": "DiskPressure", "status": "False"}
]
}
]
}

Pod Metrics

{
"timestamp": "2024-01-19T12:34:56Z",
"pods": [
{
"name": "gateway-abc123",
"namespace": "core",
"node": "s0",
"cpu_usage": 50000000,
"cpu_capacity": 4000000000,
"memory_usage": 67108864,
"memory_capacity": 8589934592,
"disk_usage": 1048576,
"disk_capacity": 107374182400
}
]
}

Combined Health Stream

The /sse/health endpoint combines both metrics types to reduce connections:

const eventSource = new EventSource('/sse/health');

eventSource.onmessage = (event) => {
const { type, payload } = JSON.parse(event.data);

if (type === 'cluster') {
// Node metrics
updateNodes(payload.nodes);
} else if (type === 'pods') {
// Pod metrics
updatePods(payload.pods);
}
};

Refresh Interval

Metrics are fetched from the Kubernetes API every 5 seconds (configurable).

Event Publishing

Cluster Monitor publishes metrics to NATS JetStream for consumption by the alerting-service:

Published Subjects

SubjectDescriptionPublish Frequency
cluster.metricsNode CPU, memory, disk, conditionsEvery 5s
cluster.podsPod restart count, OOM statusEvery 5s

Events are serialized as Protocol Buffers (see proto/cluster/events.proto).

NATS Stream Configuration

Stream: CLUSTER
Subjects: cluster.>
Retention: LimitsPolicy
MaxMsgs: 1,000,000
MaxBytes: 1 GB
MaxAge: 7 days
Storage: FileStorage

Cluster Monitor creates the CLUSTER stream on startup if it doesn't exist. Publishing is non-blocking — metrics continue to work even if NATS is unavailable.

Configuration

FlagDescriptionDefault
-addrListen address:8080
-refreshMetrics refresh interval5s
-api-serverKubernetes API server address-
-log-serviceLog service gRPC address for structured logging-
-log-sourceLog source name (pod name)cluster-monitor
-natsNATS server URLnats://nats:4222