Skip to main content

Alerting Service

The Alerting Service monitors cluster health and service logs, sending Discord notifications for critical conditions.

Features

  • Cluster Monitoring: Evaluates node CPU, memory, disk, and pod health
  • Log Monitoring: Detects critical keywords and error bursts
  • Cooldown Management: Prevents alert spam with configurable cooldowns
  • Event-Driven: Consumes NATS events from cluster-monitor and log-service
  • Discord Webhooks: Sends rich embedded alerts to Discord channels

Architecture

Alert Rules

Cluster Alerts

Alert TypeTrigger ConditionSeverityCooldown
High CPUNode CPU > 90% for 2 consecutive checksCritical5 minutes
High MemoryNode memory > 85%Warning5 minutes
High DiskNode disk > 90%Warning15 minutes
Node ConditionNode has MemoryPressure/DiskPressureCritical5 minutes

Pod Alerts

Alert TypeTrigger ConditionSeverityCooldown
OOMKilledContainer terminated with OOMKilledCriticalPer-event (tracks restart count)
Pod RestartPod restart count increasedWarningPer-event (tracks restart count)

Log Alerts

Alert TypeTrigger ConditionSeverityCooldown
Critical LogLogs contain "panic", "fatal", "crash"Critical5 minutes
Error Burst5+ errors in 30s windowWarning5 minutes

NATS Subscriptions

The service subscribes to:

Subject PatternDescription
cluster.metricsNode CPU, memory, disk metrics
cluster.podsPod restart count, OOM status
log.error.>ERROR+ level logs from all services

All events are Protocol Buffer encoded (see proto/cluster/events.proto and proto/log/events.proto).

Discord Webhook Setup

Configure the Discord webhook URL as a Kubernetes secret:

kubectl create secret generic discord-webhook-url \
--from-literal=WEBHOOK_URL='https://discord.com/api/webhooks/...'

The alerting-service deployment mounts this secret via secretKeyRef.

Alert Message Format

Discord alerts are sent as rich embeds:

{
"embeds": [{
"title": "High CPU: s0",
"description": "Node s0 CPU at 95.2% (threshold: 90%)",
"color": 16711680,
"timestamp": "2026-02-14T12:34:56Z"
}]
}

Severity Colors

SeverityColorHex
CriticalRed0xFF0000
WarningOrange0xFFA500

Cooldown Mechanism

The cooldown tracker prevents duplicate alerts within a time window:

  1. Alert fires with key cpu:s0
  2. Cooldown tracker records lastFired[cpu:s0] = now
  3. Subsequent alerts with the same key are blocked until cooldown expires
  4. After cooldown (default 5 minutes), alerts are allowed again

Different alert types have different cooldowns:

  • Disk alerts: 15 minutes (disk usage changes slowly)
  • All other alerts: 5 minutes

Restart Count Deduplication

Both OOMKilled and Pod Restart alerts use restart count tracking instead of time-based cooldown:

  1. When a restart is detected, record the pod's restart count
  2. Subsequent snapshots with the same or lower restart count are ignored (same event)
  3. Only fire again when restart count increases (new restart event)

This prevents duplicate alerts when:

  • The same OOM event persists in the pod's LastTerminationState indefinitely
  • Multiple cluster-monitor replicas publish NATS events with interleaved timestamps, causing older snapshots (lower restart count) to arrive after newer ones

The stored restart count is only updated upward, never downward, preventing the baseline from resetting.

Configuration

FlagDescriptionDefault
-natsNATS server URLnats://nats:4222
-discord-webhookDiscord webhook URL for alerts-
-alert-cooldownDefault alert cooldown duration5m
-log-service-grpcLog service gRPC address for structured logging-
-log-sourceLog source name (pod name)alerting-service

Alert thresholds are hardcoded: CPU 90%, memory 85%, disk 90%, error burst 5 errors in 30s.

Deployment

The service runs as a single-replica deployment on backend nodes:

apiVersion: apps/v1
kind: Deployment
metadata:
name: alerting-service
spec:
replicas: 1
selector:
matchLabels:
app: alerting-service
template:
metadata:
labels:
app: alerting-service
spec:
nodeSelector:
backend: "true"
containers:
- name: alerting-service
image: docker.io/eddisonso/alerting-service:latest
args:
- -nats
- nats://nats:4222
- -discord-webhook
- $(DISCORD_WEBHOOK_URL)
env:
- name: DISCORD_WEBHOOK_URL
valueFrom:
secretKeyRef:
name: discord-webhook-url
key: WEBHOOK_URL

Separation of Concerns

Alerting was extracted from cluster-monitor to improve modularity:

ServiceResponsibility
cluster-monitorCollect and serve metrics
log-serviceCollect and serve logs
alerting-serviceEvaluate rules and send alerts

This allows:

  • Independent scaling (alerting doesn't affect metric collection)
  • Cleaner separation of data collection vs. notification logic
  • Easier testing of alert rules