GUIDE · PREVIEW
GUIDE / CHA.10
source: docs/guide/chapters/10 Sustaining the Org.md
Chapters

10 Sustaining the Org

The Problem

The org is running: nodes are booted, services are up, workloads are reconciled (09 Running Workloads). Now comes the hardest part: keeping it running.

Software needs updates. Security patches need deployment. Configuration needs to change. Nodes fail and need replacement. All of this must happen without downtime, without data loss, and without a human manually managing each machine.

This is the operational lifecycle -- the difference between a demo and a production system.

Rolling Upgrades

The Challenge

Upgrading a distributed system is dangerous. If you update all nodes at once and the update is broken, the entire org goes down. If you update one at a time, the org runs a mixed-version cluster during the transition.

FortrOS uses rolling upgrades with canary verification: update one node, verify it's healthy, then proceed to the next. If the canary fails, stop and roll back.

LUKS Keyslot Rotation

Recall from 04 Disk Encryption: the LUKS key is derived from preboot_secret + ca_pubkey + generation_id. A new generation has a new generation_id, which means a new LUKS key. The upgrade process manages this transition:

  1. prepare-upgrade (on the node): Derive the new generation's LUKS key and add it as a new keyslot (cryptsetup luksAddKey). Both old and new keyslots now exist. The node can boot either generation.
  2. Install new generation: Write the new kernel + initramfs to /persist. Set it as the current boot target.
  3. Reboot: The node reboots, the preboot selects the new generation, unlocks /persist with the new keyslot.
  4. Verify: The boot watchdog confirms the maintainer is healthy. The generation is marked "ok."
  5. cleanup-upgrade (on the node): Remove the old generation's keyslot (cryptsetup luksKillSlot). The node can now only boot the new generation.

During steps 1-4, the node has both keyslots. If the new generation fails, the node reboots into the old generation (which still has a valid keyslot). Only after explicit cleanup is the old path removed.

The Rolling Sequence

For the entire org:

for each node (low-impact first):
  1. Drain: migrate workloads off this node
  2. prepare-upgrade: add new LUKS keyslot
  3. Reboot into new generation
  4. Verify OS: wait for boot watchdog + maintainer ready
  5. Hydrate: reconciler restores drained workloads on the upgraded node
  6. Verify workloads: wait for workload health checks to stabilize
  7. If healthy:
       cleanup-upgrade (remove old keyslot)
       proceed to next node
  8. If OS unhealthy:
       node reboots to old generation (still has old keyslot)
       STOP rolling upgrade, investigate
  9. If OS healthy but workloads break:
       drain the broken workloads back off
       roll back node to old generation
       STOP rolling upgrade, investigate
       (old generation's keyslot still exists until explicit cleanup)

Verification is two-stage, following the three-state pattern: the OS boots (untested -> ok via boot watchdog), then workloads are restored and tested (untested -> ok via health checks). An OS that boots but breaks workloads is caught at step 6 before the upgrade proceeds.

For major upgrades, the canary node can run a workload smoke test suite (standard test workloads that exercise the runtime before real workloads are restored). For incremental config changes, the normal workload health checks are sufficient.

This is the canary pattern: one node at a time, verify OS AND workloads before proceeding, stop on failure. The org is never fully down -- at most one node is being upgraded at a time.

Configuration Changes

CRDT-Based Propagation

Org configuration changes are written to the OrgConfig CRDT (state tree 0x02) and propagate via gossip. No SSH, no configuration management tools, no push-based deployment. Write the change on any node, gossip carries it to all nodes, reconcilers converge.

This is the same level-triggered pattern as workload management: the desired state is declared, each node reads the current desired state, and converges toward it. If a node misses a gossip round, it catches up on the next TreeSync pull.

Conflict Resolution: Locality-Wins

When two partitions of the org make different config changes during a split, FortrOS uses locality-wins: the partition that was local to the affected resource wins.

The intuition: imagine two data centers lose connectivity. Both admins use the admin service to change a setting. Datacenter A's change propagates to its nodes and takes effect -- A can verify the result is correct. Datacenter B's change propagates to its nodes, but B can't see A's resources. When connectivity returns and the CRDTs merge, A's change to A's resources wins (it was enacted and verified), and B's change to B's resources wins (same reason). A node's own metadata is self-authoritative -- it knows its own status better than a remote partition that can't reach it.

For org-wide config changes (not per-node), the story depends on what changed:

  • Different config keys: Merge cleanly. Both changes apply. No conflict.
  • Same config key, different values: Precondition-based resolution. Each change carries a precondition: "set max_replicas to 5, but only if the current value is 3." After merge, neither precondition matches (the other side already changed the value), so both changes become explicitly conflicted -- surfaced to the admin via the three-state pattern (pending -> conflicted) rather than silently resolved.
  • Revocations: No conflict possible. GSet is grow-only. Both sides revoked a node? Still revoked. Different nodes? Both revoked.

In practice, two partitions making conflicting org-wide config changes simultaneously is rare and usually indicates someone should be coordinating.

Config Changes During a Rolling Upgrade

What happens if a new config change is pushed via CRDT while a rolling upgrade is in progress? The level-triggered model handles this naturally:

  • Already-upgraded nodes see the config change via gossip and converge to the new desired state (they have the new generation + new config).
  • Not-yet-upgraded nodes also see the config change and converge (they have the old generation + new config).
  • The node currently being upgraded picks up the config change after reboot, during workload reconciliation.

No special coordination needed. The rolling upgrade is "reboot nodes one at a time with the new generation." The config change is "update desired state in the CRDT." These are independent operations that compose: each node converges to whatever the current desired state says, regardless of which generation it's running.

If the config change is incompatible with the old generation (requires the new generation's features), the not-yet-upgraded nodes can't converge -- the reconciler detects the incompatibility and surfaces it. The admin can either wait for those nodes to be upgraded (the rollout will get to them) or prioritize upgrading them next.

Out-of-Band Management

When a node is unresponsive (kernel hang, hardware issue), the normal management path (gossip, SSH over overlay) doesn't work. Out-of-Band Management provides recovery through channels independent of the node's OS.

The OOB mechanism depends on where the node runs:

Environment OOB Channel Capabilities
Bare metal (Intel vPro) AMT via Intel ME Power cycle, remote console, PXE boot
Bare metal (AMD PRO) AMD DASH Power cycle, health monitoring
Bare metal (server) IPMI / BMC (iLO, iDRAC) Power cycle, serial console, virtual media
VPS Provider API / web console Hard reset, serial console, rescue mode, ISO mount
VM on Proxmox/vSphere Hypervisor management API Power cycle, console, snapshot/restore
Consumer hardware Physical access Power button

The self-healing loop (when OOB is available):

  1. The org detects a node failure via SWIM gossip (no response to probes)
  2. The maintainer on another node sends a reset via the appropriate OOB channel (AMT for bare metal, provider API for VPS, hypervisor API for VMs)
  3. The failed node reboots from its local preboot UKI
  4. Normal boot chain: authenticate -> unlock /persist -> kexec -> rejoin org
  5. The org heals itself -- no human, no USB stick, no physical access

For nodes with no OOB channel (consumer hardware with ignore or neutralize ME policy, no IPMI), recovery requires physical access.

Self-Healing Patterns

Node Replacement

A node that is permanently failed (dead hardware) is replaced by:

  1. Revoking the failed node's membership (removes from CRDT)
  2. Gossip propagates the revocation
  3. Other nodes remove the dead node's WireGuard peer
  4. A new machine enrolls with a new identity
  5. The org's storage layer re-replicates shards from the failed node's surviving copies to the new node

Workload Recovery

When a node fails, the reconcilers on surviving nodes detect that its workloads are no longer observed (WorkloadObserved tree 0x04 shows no heartbeat). The reconcilers schedule replacement workloads on available nodes per the placement constraints in WorkloadDesired.

Split-Merge Recovery

When a network partition heals and two halves of the org reconnect:

  1. Gossip detects new peers (SWIM)
  2. TreeSync pulls trigger CRDT merges
  3. CRDTs resolve automatically (Orswot for membership, MVReg for metadata, GSet for revocations)
  4. Reconcilers on all nodes re-evaluate desired state and converge

If genuinely conflicting changes were made (both halves changed the same config in incompatible ways), the conflict is surfaced for human resolution. This is rare and requires deliberate concurrent org-wide changes on both sides of a partition.

How Others Do It

Kubernetes: Rolling Deployment

Kubernetes rolling deployments update pods one at a time, checking readiness probes between steps. If a pod fails readiness, the rollout pauses. Rollback is kubectl rollout undo.

Strength: Mature, well-tested, handles complex rollout strategies (canary, blue-green). Weakness: Only manages application deployments, not the underlying OS or kernel.

Talos: Upgrade via API

Talos upgrades the OS by pulling a new container image and rebooting. Upgrades are triggered via gRPC API (talosctl upgrade). The upgrade process is atomic: new image is written, machine reboots, if it fails to come back healthy the old image is still on the other A/B partition.

Strength: Atomic, self-contained. Weakness: Requires manual trigger per node (no built-in rolling orchestration across the cluster).

NixOS: Rebuild and Switch

NixOS rebuilds the system from a declarative config (nixos-rebuild switch). Each rebuild creates a new boot generation. Rollback is selecting a previous generation from the boot menu.

Strength: Any generation is one reboot away. Full system reproducibility. Weakness: Requires network access to the Nix store during rebuild. No built-in cluster orchestration.

The Tradeoffs

Feature Kubernetes Talos NixOS FortrOS
What's upgraded Containers Entire OS image Entire OS config Kernel generation
Rollback Deployment history A/B partition Boot menu generations Generation health markers
Orchestration Built-in rolling Manual per-node Manual per-node Rolling with canary
LUKS key rotation N/A N/A N/A Yes (keyslot per generation)
Self-healing Pod restart/reschedule Reboot Manual AMT power-cycle + auto-rejoin
Split-merge N/A (requires majority) N/A (requires majority) N/A (single node) CRDT auto-merge

Summary: The Full Boot-to-Org Chain

This guide has walked through the entire lifecycle:

Stage Chapter What Happens
Power on 01 Power and Firmware Firmware initializes hardware, loads boot target
Find OS 02 Finding the OS Firmware loads preboot UKI from ESP (or PXE)
Trust 03 Trust and Identity Preboot authenticates to org via TPM credentials
Encrypt 04 Disk Encryption Preboot derives LUKS key, unlocks /persist
Real OS 05 Loading the Real OS Preboot kexec's into generation kernel
Services 06 Init and Services s6-rc starts all services in dependency order
Network 07 Overlay Networking WireGuard mesh connects to other nodes
Cluster 08 Cluster Formation Gossip + CRDTs establish org membership
Workloads 09 Running Workloads Reconciler starts containers and VMs
Sustain 10 Sustaining the Org Rolling upgrades, config changes, self-healing

Each stage has clean boundaries: explicit inputs, outputs, and "does NOT do" constraints. The preboot handles stages 1-5 (firmware through kexec). The main OS handles stages 6-10 (init through ongoing operations).

Disk Layout Management

The org controls disk layout as declarative state, just like config and workload placement. Each node's disk layout is desired state in the org's CRDTs, and the maintainer reconciles toward it.

First Boot: Deterministic Defaults

On first boot, the preboot has no org connection yet. It probes the hardware (how many disks, what type, how much RAM) and applies a deterministic default layout:

  • ESP on the fastest disk (512MB)
  • Hibernate partition sized to RAM (fastest disk)
  • /persist at 2GB (fastest disk)

These are the minimum partitions needed for the preboot to function. No org involvement needed. The sizes are computable from the hardware probe alone.

Post-Enrollment: Org Takes Over

Once the node joins the org, it reports its hardware inventory via gossip. The org (via admin policy or automated placement logic) pushes a desired disk layout:

node/<pubkey>/disk_layout:
  nvme0n1:
    - role: esp, size: 512M
    - role: hibernate, size: 32G
    - role: persist, size: 2G
    - role: pool, size: rest
  sda:
    - role: swap, size: 50G
    - role: pool, size: rest
  pool_policy:
    scratch_limit: 200G
    shards: rest

The maintainer reads the desired layout, compares to the actual layout (from disk-probe), and reconciles. New partitions are created. Pool limits are adjusted. Swap devices are added or resized.

Live Rebalancing

The org can change disk layout at runtime:

  • Grow scratch for a database workload: Increase scratch_limit. The dm-thin pool adjusts immediately (thin provisioning = no data movement needed, just allow more allocation).
  • Shrink scratch to make room for shards: Decrease scratch_limit. If scratch usage exceeds the new limit, migrate VMs off this node first (level-triggered: reconciler drains workloads, then maintainer shrinks the limit).
  • Add a new disk: Admin plugs in a disk. disk-probe detects it, reports to org. Org pushes updated layout with the new disk. Maintainer partitions it (swap + pool). Shards begin distributing to the new storage.
  • Disk failure: Node reports missing disk to org. Org recalculates: "with this disk gone, these shards are at risk." Redistributes shards to other nodes. The failed disk's pool partitions are marked degraded.

This is the same level-triggered pattern as workload placement and config changes. The org says WHAT it wants. The node makes it so. Drift is detected and corrected on every gossip cycle.

Further Reading

Concepts:

Services:

FortrOS implementation: