Monitoring and Self-Observation
What It Is
FortrOS monitors itself. No Prometheus, no Grafana, no external monitoring agents. Every maintainer collects health metrics from its host and shares them via gossip. The org knows its own health without any infrastructure beyond what it already has.
The interface is a topology-aware map: pannable, zoomable, showing the org's physical structure (regions, sites, racks, nodes) with color-coded health states and tiered notifications that suppress noise by correlating failures to root causes.
Why It Matters
Traditional monitoring is bolted on: install Prometheus, configure scrape targets, deploy Grafana, set up alerting rules, maintain the monitoring infrastructure alongside the thing it monitors. The monitoring system itself can fail, go stale, or disagree with reality.
FortrOS's maintainer already talks to every node via gossip. It already knows who's alive, who's unreachable, and what state the org is in. Making health metrics part of gossip is a natural extension, not a separate system.
How It Works
Self-Reporting
Each maintainer collects metrics from its own host at regular intervals (10 seconds to 5 minutes depending on the metric type):
| Category | Metrics |
|---|---|
| System | CPU per-core utilization, RAM (used/available/zram ratio), swap per-device breakdown, disk I/O |
| Hardware | SMART attributes, temperatures, fan speeds, GPU utilization |
| Services | VM states, service health, process supervision events (s6 restarts) |
| Org | Gossip round-trip times, CRDT sync state, cert validity, pending-confirmation counts |
Metrics are stored in ephemeral in-memory ring buffers on each node. On reboot, local history is lost -- but other nodes retain gossip-derived summaries. This is intentional: long-term storage is an optional org service, not baked into the monitoring layer.
Topology-Aware Alerting
The Topology Map drives notification hierarchy. Alerts are tiered by scope and correlated by infrastructure:
| Tier | Scope | Example | Suppresses |
|---|---|---|---|
| Tier 1 | Org-wide | Network partition, multi-site outage | Everything below |
| Tier 2 | Site/rack | Switch group down, PDU failure | Individual host alerts behind the failed infrastructure |
| Tier 3 | Individual host | Disk SMART warning, service crash | Nothing |
Root cause correlation: If 3 hosts on the same switch go offline simultaneously, the alert is "switch-group-1 connectivity issue" -- not 3 separate "host unreachable" alerts. A service failure preceded by disk errors links the two: "service X failed due to disk errors on /dev/sdb."
This correlation uses the topology map: nodes that share a failure domain (same rack, same switch, same PDU) are correlated when they fail together. Individual failures are reported individually. Infrastructure failures are reported as infrastructure, with affected nodes listed underneath.
The Map Interface
The admin interface is a visual topology map served by any maintainer (no dedicated monitoring server). The UI is web-based (WebSocket for live updates) and shows:
- Org level: Regions and sites, with aggregate health per site
- Site level: Racks and nodes, with color-coded health states (green = healthy, yellow = degraded, red = critical, gray = unreachable, pulsing = state change in progress)
- Node level: Click a node to see its metrics, services, VMs, disk health, and gossip state
- Drill-down: The admin interface asks the relevant node (or a peer in its zone) for detailed metrics. Detail stays where it's relevant, not replicated everywhere.
No External Dependencies
The monitoring system requires nothing beyond what FortrOS already provides:
- Data collection: Maintainer reads local /proc, /sys, SMART, s6 state
- Data distribution: Gossip carries health summaries
- Alerting: Maintainer evaluates alert rules locally + correlates via topology
- UI: Any maintainer serves the web interface
If the org wants long-term analytics (capacity planning, trend analysis), Prometheus can be deployed as a tier 2 org service that scrapes maintainers. But it's optional -- the built-in system handles real-time health and alerting without it.
How FortrOS Uses It
- Partition detection: If gossip splits into isolated groups, the alert shows the partition boundary, not just "hosts unreachable."
- Hardware lifecycle: SMART prediction surfaces "disk likely to fail within weeks" alerts. The placement service proactively re-replicates shards off degrading disks before they fail.
- Rolling upgrade tracking: During a rolling upgrade (10 Sustaining the Org), the map shows which nodes are upgraded, which are pending, and which are being drained. The admin sees the upgrade's progress geographically.
- Workload health: VM and container health (from the reconciler's observed state) is overlaid on the node map. A failed workload shows on the node that was running it.
Links
- Topology Map -- The hierarchical failure domain model
- Gossip Protocols -- How health data propagates
- 08 Cluster Formation -- How nodes discover and track each other