Alerting Service
The Alerting Service monitors cluster health and service logs, sending Discord notifications for critical conditions.
Features
- Cluster Monitoring: Evaluates node CPU, memory, disk, and pod health
- Log Monitoring: Detects critical keywords and error bursts
- Cooldown Management: Prevents alert spam with configurable cooldowns
- Event-Driven: Consumes NATS events from cluster-monitor and log-service
- Discord Webhooks: Sends rich embedded alerts to Discord channels
Architecture
Alert Rules
Cluster Alerts
| Alert Type | Trigger Condition | Severity | Cooldown |
|---|---|---|---|
| High CPU | Node CPU > 90% for 2 consecutive checks | Critical | 5 minutes |
| High Memory | Node memory > 85% | Warning | 5 minutes |
| High Disk | Node disk > 90% | Warning | 15 minutes |
| Node Condition | Node has MemoryPressure/DiskPressure | Critical | 5 minutes |
Pod Alerts
| Alert Type | Trigger Condition | Severity | Cooldown |
|---|---|---|---|
| OOMKilled | Container terminated with OOMKilled | Critical | Per-event (tracks restart count) |
| Pod Restart | Pod restart count increased | Warning | Per-event (tracks restart count) |
Log Alerts
| Alert Type | Trigger Condition | Severity | Cooldown |
|---|---|---|---|
| Critical Log | Logs contain "panic", "fatal", "crash" | Critical | 5 minutes |
| Error Burst | 5+ errors in 30s window | Warning | 5 minutes |
NATS Subscriptions
The service subscribes to:
| Subject Pattern | Description |
|---|---|
cluster.metrics | Node CPU, memory, disk metrics |
cluster.pods | Pod restart count, OOM status |
log.error.> | ERROR+ level logs from all services |
All events are Protocol Buffer encoded (see proto/cluster/events.proto and proto/log/events.proto).
Discord Webhook Setup
Configure the Discord webhook URL as a Kubernetes secret:
kubectl create secret generic discord-webhook-url \
--from-literal=WEBHOOK_URL='https://discord.com/api/webhooks/...'
The alerting-service deployment mounts this secret via secretKeyRef.
Alert Message Format
Discord alerts are sent as rich embeds:
{
"embeds": [{
"title": "High CPU: s0",
"description": "Node s0 CPU at 95.2% (threshold: 90%)",
"color": 16711680,
"timestamp": "2026-02-14T12:34:56Z"
}]
}
Severity Colors
| Severity | Color | Hex |
|---|---|---|
| Critical | Red | 0xFF0000 |
| Warning | Orange | 0xFFA500 |
Cooldown Mechanism
The cooldown tracker prevents duplicate alerts within a time window:
- Alert fires with key
cpu:s0 - Cooldown tracker records
lastFired[cpu:s0] = now - Subsequent alerts with the same key are blocked until cooldown expires
- After cooldown (default 5 minutes), alerts are allowed again
Different alert types have different cooldowns:
- Disk alerts: 15 minutes (disk usage changes slowly)
- All other alerts: 5 minutes
Restart Count Deduplication
Both OOMKilled and Pod Restart alerts use restart count tracking instead of time-based cooldown:
- When a restart is detected, record the pod's restart count
- Subsequent snapshots with the same or lower restart count are ignored (same event)
- Only fire again when restart count increases (new restart event)
This prevents duplicate alerts when:
- The same OOM event persists in the pod's
LastTerminationStateindefinitely - Multiple cluster-monitor replicas publish NATS events with interleaved timestamps, causing older snapshots (lower restart count) to arrive after newer ones
The stored restart count is only updated upward, never downward, preventing the baseline from resetting.
Configuration
| Flag | Description | Default |
|---|---|---|
-nats | NATS server URL | nats://nats:4222 |
-discord-webhook | Discord webhook URL for alerts | - |
-alert-cooldown | Default alert cooldown duration | 5m |
-log-service-grpc | Log service gRPC address for structured logging | - |
-log-source | Log source name (pod name) | alerting-service |
Alert thresholds are hardcoded: CPU 90%, memory 85%, disk 90%, error burst 5 errors in 30s.
Deployment
The service runs as a single-replica deployment on backend nodes:
apiVersion: apps/v1
kind: Deployment
metadata:
name: alerting-service
spec:
replicas: 1
selector:
matchLabels:
app: alerting-service
template:
metadata:
labels:
app: alerting-service
spec:
nodeSelector:
backend: "true"
containers:
- name: alerting-service
image: docker.io/eddisonso/alerting-service:latest
args:
- -nats
- nats://nats:4222
- -discord-webhook
- $(DISCORD_WEBHOOK_URL)
env:
- name: DISCORD_WEBHOOK_URL
valueFrom:
secretKeyRef:
name: discord-webhook-url
key: WEBHOOK_URL
Separation of Concerns
Alerting was extracted from cluster-monitor to improve modularity:
| Service | Responsibility |
|---|---|
| cluster-monitor | Collect and serve metrics |
| log-service | Collect and serve logs |
| alerting-service | Evaluate rules and send alerts |
This allows:
- Independent scaling (alerting doesn't affect metric collection)
- Cleaner separation of data collection vs. notification logic
- Easier testing of alert rules