Alert Rules
Alerts are composed of rules, mounts, global settings, and notification channels.
Built-in Rules
| ID | Name | Metric | Condition | Cooldown |
|---|---|---|---|---|
-1 | node_offline | node.offline | >= 1 | 0 |
-2 | raid_failed | raid.failed | >= 1 | 30 minutes |
-3 | smart_failed | disk.smart.failed | >= 1 | 30 minutes |
-4 | smart_nvme_critical_warning | disk.smart.nvme.critical_warning | >= 1 | 30 minutes |
Built-in rules are mounted by default. Rule mounts can disable or enable rules for specific nodes.
Supported Metrics
| Metric | Description | Supports core_plus |
|---|---|---|
cpu.usage_ratio | CPU usage ratio 0..1 | No |
cpu.load1 | 1-minute load | Yes |
cpu.load5 | 5-minute load | Yes |
cpu.load15 | 15-minute load | Yes |
mem.used | Used memory bytes | No |
mem.used_ratio | Memory usage ratio 0..1 | No |
disk.usage.used_ratio | Main mount disk usage ratio | No |
disk.smart.failed | Count of devices with health=failed | No |
disk.smart.nvme.critical_warning | Count of NVMe devices with non-zero critical_warning | No |
disk.smart.attribute_failing | Count of ATA SMART attributes currently in FAILING_NOW | No |
disk.smart.max_temp_c | Max SMART device temperature in C | No |
net.recv_bps | Receive rate in B/s | No |
net.sent_bps | Send rate in B/s | No |
conn.tcp | TCP connection count | No |
raid.failed | Failed RAID members or unhealthy arrays | No |
thermal.max_temp_c | Max thermal sensor temperature in C | No |
disk.usage.used_ratio uses / first. If / is missing, it falls back to the first mount for compatibility.
disk.smart.failed counts only health=failed. Collection states such as no_cache, no_tool, unsupported, and stale are not counted as disk failures.
disk.smart.nvme.critical_warning counts only devices that report critical_warning with a non-zero value. If no device reports the field, the metric is not evaluated.
disk.smart.attribute_failing counts only failing_attrs[].when_failed=FAILING_NOW. If no failing attribute data is available, the metric is not evaluated.
Operators
>>=<<===!=
Duration
Allowed values:
060300
Unit is seconds. Missing duration_sec defaults to 60 when creating a rule.
Threshold Mode
| Mode | Description |
|---|---|
static | Use threshold directly |
core_plus | Load metrics only; threshold is CPU cores + threshold + threshold_offset |
In static mode, threshold_offset must be 0.
CPU core count uses logical cores first, then physical cores.
Lifecycle
- A node report marks the node for alert evaluation.
- The alert service reads the hot snapshot.
- Enabled rules and node mount state are compiled.
- Matching conditions open events after duration is met.
- Non-matching conditions close events.
- If notification targets can be loaded, payloads are written to the notification outbox according to global settings.
- A leased worker retries notification delivery.
The alert service does not open new alert events during the first minute after startup.
If notification targets are unavailable, alert events and runtime state are still committed, but no notification outbox item is added.