Skip to main content

Alert Rules

Alerts are composed of rules, mounts, global settings, and notification channels.

Built-in Rules

IDNameMetricConditionCooldown
-1node_offlinenode.offline>= 10
-2raid_failedraid.failed>= 130 minutes
-3smart_faileddisk.smart.failed>= 130 minutes
-4smart_nvme_critical_warningdisk.smart.nvme.critical_warning>= 130 minutes

Built-in rules are mounted by default. Rule mounts can disable or enable rules for specific nodes.

Supported Metrics

MetricDescriptionSupports core_plus
cpu.usage_ratioCPU usage ratio 0..1No
cpu.load11-minute loadYes
cpu.load55-minute loadYes
cpu.load1515-minute loadYes
mem.usedUsed memory bytesNo
mem.used_ratioMemory usage ratio 0..1No
disk.usage.used_ratioMain mount disk usage ratioNo
disk.smart.failedCount of devices with health=failedNo
disk.smart.nvme.critical_warningCount of NVMe devices with non-zero critical_warningNo
disk.smart.attribute_failingCount of ATA SMART attributes currently in FAILING_NOWNo
disk.smart.max_temp_cMax SMART device temperature in CNo
net.recv_bpsReceive rate in B/sNo
net.sent_bpsSend rate in B/sNo
conn.tcpTCP connection countNo
raid.failedFailed RAID members or unhealthy arraysNo
thermal.max_temp_cMax thermal sensor temperature in CNo

disk.usage.used_ratio uses / first. If / is missing, it falls back to the first mount for compatibility.

disk.smart.failed counts only health=failed. Collection states such as no_cache, no_tool, unsupported, and stale are not counted as disk failures.

disk.smart.nvme.critical_warning counts only devices that report critical_warning with a non-zero value. If no device reports the field, the metric is not evaluated.

disk.smart.attribute_failing counts only failing_attrs[].when_failed=FAILING_NOW. If no failing attribute data is available, the metric is not evaluated.

Operators

  • >
  • >=
  • <
  • <=
  • ==
  • !=

Duration

Allowed values:

  • 0
  • 60
  • 300

Unit is seconds. Missing duration_sec defaults to 60 when creating a rule.

Threshold Mode

ModeDescription
staticUse threshold directly
core_plusLoad metrics only; threshold is CPU cores + threshold + threshold_offset

In static mode, threshold_offset must be 0.

CPU core count uses logical cores first, then physical cores.

Lifecycle

  1. A node report marks the node for alert evaluation.
  2. The alert service reads the hot snapshot.
  3. Enabled rules and node mount state are compiled.
  4. Matching conditions open events after duration is met.
  5. Non-matching conditions close events.
  6. If notification targets can be loaded, payloads are written to the notification outbox according to global settings.
  7. A leased worker retries notification delivery.

The alert service does not open new alert events during the first minute after startup.

If notification targets are unavailable, alert events and runtime state are still committed, but no notification outbox item is added.