Alerting Service

The Alerting Service monitors cluster health and service logs, sending Discord notifications for critical conditions.

Features

Cluster Monitoring: Evaluates node CPU, memory, disk, and pod health
Log Monitoring: Detects critical keywords and error bursts
Cooldown Management: Prevents alert spam with configurable cooldowns
Event-Driven: Consumes NATS events from cluster-monitor and log-service
Discord Webhooks: Sends rich embedded alerts to Discord channels

Architecture

Alert Rules

Cluster Alerts

Alert Type	Trigger Condition	Severity	Cooldown
High CPU	Node CPU > 90% for 2 consecutive checks	Critical	5 minutes
High Memory	Node memory > 85%	Warning	5 minutes
High Disk	Node disk > 90%	Warning	15 minutes
Node Condition	Node has MemoryPressure/DiskPressure	Critical	5 minutes

Pod Alerts

Alert Type	Trigger Condition	Severity	Cooldown
OOMKilled	Container terminated with OOMKilled	Critical	Per-event (tracks restart count)
Pod Restart	Pod restart count increased	Warning	Per-event (tracks restart count)

Log Alerts

Alert Type	Trigger Condition	Severity	Cooldown
Critical Log	Logs contain "panic", "fatal", "crash"	Critical	5 minutes
Error Burst	5+ errors in 30s window	Warning	5 minutes

NATS Subscriptions

The service subscribes to:

Subject Pattern	Description
`cluster.metrics`	Node CPU, memory, disk metrics
`cluster.pods`	Pod restart count, OOM status
`log.error.>`	ERROR+ level logs from all services

All events are Protocol Buffer encoded (see proto/cluster/events.proto and proto/log/events.proto).

Discord Webhook Setup

Configure the Discord webhook URL as a Kubernetes secret:

kubectl create secret generic discord-webhook-url \
  --from-literal=WEBHOOK_URL='https://discord.com/api/webhooks/...'

The alerting-service deployment mounts this secret via secretKeyRef.

Alert Message Format

Discord alerts are sent as rich embeds:

{
  "embeds": [{
    "title": "High CPU: s0",
    "description": "Node s0 CPU at 95.2% (threshold: 90%)",
    "color": 16711680,
    "timestamp": "2026-02-14T12:34:56Z"
  }]
}

Severity Colors

Severity	Color	Hex
Critical	Red	`0xFF0000`
Warning	Orange	`0xFFA500`

Cooldown Mechanism

The cooldown tracker prevents duplicate alerts within a time window:

Alert fires with key cpu:s0
Cooldown tracker records lastFired[cpu:s0] = now
Subsequent alerts with the same key are blocked until cooldown expires
After cooldown (default 5 minutes), alerts are allowed again

Different alert types have different cooldowns:

Disk alerts: 15 minutes (disk usage changes slowly)
All other alerts: 5 minutes

Restart Count Deduplication

Both OOMKilled and Pod Restart alerts use restart count tracking instead of time-based cooldown:

When a restart is detected, record the pod's restart count
Subsequent snapshots with the same or lower restart count are ignored (same event)
Only fire again when restart count increases (new restart event)

This prevents duplicate alerts when:

The same OOM event persists in the pod's LastTerminationState indefinitely
Multiple cluster-monitor replicas publish NATS events with interleaved timestamps, causing older snapshots (lower restart count) to arrive after newer ones

The stored restart count is only updated upward, never downward, preventing the baseline from resetting.

Configuration

Flag	Description	Default
`-nats`	NATS server URL	`nats://nats:4222`
`-discord-webhook`	Discord webhook URL for alerts	-
`-alert-cooldown`	Default alert cooldown duration	`5m`
`-log-service-grpc`	Log service gRPC address for structured logging	-
`-log-source`	Log source name (pod name)	`alerting-service`

Alert thresholds are hardcoded: CPU 90%, memory 85%, disk 90%, error burst 5 errors in 30s.

Deployment

The service runs as a single-replica deployment on backend nodes:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: alerting-service
spec:
  replicas: 1
  selector:
    matchLabels:
      app: alerting-service
  template:
    metadata:
      labels:
        app: alerting-service
    spec:
      nodeSelector:
        backend: "true"
      containers:
      - name: alerting-service
        image: docker.io/eddisonso/alerting-service:latest
        args:
          - -nats
          - nats://nats:4222
          - -discord-webhook
          - $(DISCORD_WEBHOOK_URL)
        env:
          - name: DISCORD_WEBHOOK_URL
            valueFrom:
              secretKeyRef:
                name: discord-webhook-url
                key: WEBHOOK_URL

Separation of Concerns

Alerting was extracted from cluster-monitor to improve modularity:

Service	Responsibility
cluster-monitor	Collect and serve metrics
log-service	Collect and serve logs
alerting-service	Evaluate rules and send alerts

This allows:

Independent scaling (alerting doesn't affect metric collection)
Cleaner separation of data collection vs. notification logic
Easier testing of alert rules

Features​

Architecture​

Alert Rules​

Cluster Alerts​

Pod Alerts​

Log Alerts​

NATS Subscriptions​

Discord Webhook Setup​

Alert Message Format​

Severity Colors​

Cooldown Mechanism​

Restart Count Deduplication​

Configuration​

Deployment​

Separation of Concerns​