08 Cluster Formation
The Problem
Nodes have overlay connectivity (07 Overlay Networking). They can send encrypted packets to each other. But connectivity isn't coordination. The nodes need to:
- Discover each other: Who else is in the org? Who just joined? Who left?
- Agree on state: What's the org's configuration? Which nodes are healthy? What workloads should run where?
- Handle disagreement: What if two partitions of the org made different changes while disconnected?
This is the distributed consensus problem, and it's one of the hardest problems in computer science. The classic solutions (Raft, Paxos) require a majority of nodes to be reachable -- if your cluster of 5 nodes loses 3, the remaining 2 can't make decisions.
FortrOS rejects this model. An org that splits in half should have both halves continue operating. When they reconnect, state merges automatically. This requires a different approach: Gossip Protocols for discovery and CRDTs for state.
The Building Blocks
Gossip: How Nodes Discover Each Other
Gossip Protocols are how nodes learn about each other without a central directory. The idea is borrowed from epidemiology: information spreads like a virus. Each node periodically picks a random peer and exchanges state. After a few rounds, every node knows about every other node.
FortrOS uses the SWIM protocol (Scalable Weakly-consistent Infection-style
Membership) via the foca Rust crate. SWIM adds failure detection: nodes
periodically probe each other (ping), and if a node doesn't respond, it's
suspected and eventually declared dead.
Gossip provides:
- Membership: Who's in the org right now
- Failure detection: Who's unreachable
- Dissemination: Piggybacking small messages on protocol traffic
Gossip does NOT provide agreement on complex state -- it just ensures everyone eventually learns the same membership information.
CRDTs: How Nodes Agree Without a Leader
CRDTs (Conflict-free Replicated Data Types) are data structures where conflicts are machine-resolvable. Two nodes can make different changes simultaneously, and when those changes meet, deterministic rules resolve the disagreement without human intervention. No leader, no voting, no quorum. Each node modifies its local copy, and when copies merge, the result is mathematically guaranteed to converge to the same state on all nodes.
Conflicts still happen (two nodes changed the same thing during a partition), but the CRDTs carry enough context (causality tracking via logical counters, not wall-clock time) that pre-defined rules always determine the outcome. FortrOS deliberately avoids depending on synchronized clocks -- physical time is fragile in distributed systems (clock drift, NTP failures, adversarial manipulation). CRDTs track which operations have seen which other operations (causality), not when they happened (time). Human intervention is a last resort for genuinely ambiguous org-wide policy conflicts.
FortrOS uses several CRDT types (from the crdts Rust crate):
- Orswot (Observed-Remove Set Without Tombstones): for org membership
- MVReg (Multi-Value Register): for per-node metadata that may be concurrently modified
- GSet (Grow-only Set): for revocation lists (once revoked, always revoked)
- Map: for structured key-value state
Merkle Trees: Detecting Divergence
Merkle Trees are hash trees where each leaf is a hash of a data block, and each non-leaf is a hash of its children. The root hash is a fingerprint of the entire dataset. If any leaf changes, the root hash changes.
FortrOS uses Merkle trees to detect state divergence efficiently: two nodes compare root hashes (32 bytes). If they match, the nodes are in sync. If they differ, they walk the tree to find exactly which leaves differ, exchanging only the changed data.
TreeSync: Efficient State Transfer
TreeSync is FortrOS's protocol for synchronizing state between nodes. When gossip broadcasts indicate a hash mismatch (one node's state has changed), TreeSync handles the actual data transfer:
- Client sends its top-level Merkle tree hashes (~4KB)
- Server walks divergent subtrees
- Server returns only the changed leaves (~1KB for 1 changed leaf, even in a 10,000-node org)
- Client merges the received leaves into its local CRDTs
This is far more efficient than sending the full state on every change.
How Others Do It
Kubernetes (etcd): Strong Consensus
Kubernetes stores all cluster state in etcd, which uses the Raft consensus algorithm. Raft requires a majority of etcd nodes to be reachable for any write to succeed. A 3-node etcd cluster tolerates 1 failure; a 5-node cluster tolerates 2.
Strength: Strong consistency -- all reads return the latest written value. Weakness: Cannot operate during a network partition if the majority is unreachable. The cluster is "correct but unavailable."
Consul (Serf + Raft): Two Layers
Consul uses two coordination mechanisms: Serf (gossip, SWIM-based) for membership and failure detection, and Raft for the key-value store and service catalog. Serf handles "who's in the cluster" (eventually consistent). Raft handles "what's the configuration" (strongly consistent).
Strength: Best-of-both: fast membership via gossip, strong config via Raft. Weakness: Raft still requires a majority of server nodes. Consul agents (non-servers) can't participate in Raft.
Riak: CRDTs for Real
Riak is a distributed database that uses CRDTs for conflict resolution. Multiple clients can write to the same key simultaneously, and Riak merges the values using CRDT semantics. No leader election, no quorum for writes (configurable consistency levels).
Strength: Available during partitions. Writes always succeed locally. Weakness: Eventually consistent -- reads may return stale data briefly. CRDT merge semantics can be surprising (concurrent counter increments work, but concurrent "set value to X" creates a multi-value register).
The Tradeoffs
| Approach | Partition tolerance | Consistency | Write availability | Complexity |
|---|---|---|---|---|
| Raft (etcd, Consul servers) | Majority required | Strong | Blocks without majority | Low (single leader) |
| Gossip + CRDTs (FortrOS, Riak) | Both halves operate | Eventual | Always (local writes) | Higher (merge semantics) |
| Gossip only (Serf) | Both halves operate | Eventual (membership) | Always | Low (membership only) |
FortrOS chose gossip + CRDTs because the org must survive arbitrary network partitions. A homelab where one machine is on a laptop (Wi-Fi, intermittent) and another is a VPS (always connected) shouldn't require both to be reachable for either to operate.
How FortrOS Does It
State Trees: A Reusable Primitive
FortrOS's replicated state system is built around state trees -- a reusable primitive that any service can use. A state tree wraps a CRDT dataset with a Merkle tree for efficient sync. Services register a state tree with the maintainer and immediately get: gossip-based change notification, Merkle-efficient sync, and conflict resolution via a chosen schema. No CRDT code in the service itself.
The current state trees in the org:
| Name | Contents | Resolution Schema |
|---|---|---|
| OrgOperational | Membership, per-node metadata, revocations | Self-authoritative / grow-only |
| OrgConfig | Org-wide declarative settings | Precondition-based |
| WorkloadDesired | What workloads should run (specs, placement) | Self-authoritative per owner |
| WorkloadObserved | What workloads are actually running (status) | Self-authoritative per reporter |
Tree registration should use name-based lookup (services register by name, the system assigns an ID) rather than hardcoded numeric IDs. Hardcoded IDs are a collision risk as third-party services and org-specific extensions register their own trees. The grouping of datasets into trees is driven by sync granularity -- data that changes together and is consumed together belongs in one tree.
The Sync Protocol
The protocol has two layers: gossip broadcasts are hints that something changed, and TreeSync over TCP is the reliable merge/resolution mechanism.
Gossip layer (hints):
- A node modifies its local CRDT. The state tree recalculates its Merkle root.
- The node broadcasts its recent Merkle root hashes (a configurable history ring) via foca gossip (~65 bytes per broadcast).
- Receiving nodes compare the broadcast hashes against their own history. If a peer is broadcasting a hash the receiver has already moved past, the receiver knows it has newer data. If the hash is unknown, a sync is needed.
- Only the originator of a change broadcasts. Nodes that receive state via TreeSync do not re-broadcast -- this prevents convergence storms where partial merges trigger cascading broadcasts. Every node still converges because gossip is probabilistic (each round picks random peers) and TreeSync pulls are also triggered periodically, not only on hash mismatch. A missed broadcast is caught by the next gossip round or periodic pull.
TreeSync layer (reliable resolution):
- Pull: TCP connection over WireGuard. The client sends its top-level Merkle tree hashes (~4KB). The server walks divergent subtrees and returns only changed leaves.
- Merge: The client merges received leaves into its local CRDTs using the tree's conflict resolution schema.
- Push-notify: After pulling, the client sends its new hash to the server. If hashes still differ (bidirectional changes), the server triggers a reverse pull.
- Convergence: After 1-2 rounds of bidirectional TreeSync, all participating nodes have identical state.
Enrollment Completion
The enrollment that started in 03 Trust and Identity completes here:
- Preboot authenticated, got TPM credentials (chapter 3)
- /persist unlocked, keys accessible (chapter 4)
- kexec'd into main OS (chapter 5)
- s6-rc started all services (chapter 6)
- WireGuard overlay connected (chapter 7)
- Now: Maintainer joins gossip mesh, presents enrollment nonce
- Provisioner validates nonce, promotes pending -> confirmed
- Confirmed enrollment gossips to all nodes via CRDT
- Node is a full org member
Conflict Resolution Schemas
FortrOS provides a set of resolution schemas that state trees choose from at registration. Services pick a schema and immediately benefit from automatic conflict resolution -- no conflict-handling code in the service itself.
| Schema | How It Works | Current Users |
|---|---|---|
| Self-authoritative | Each actor owns its own entries. Only the actor can modify its data. No conflict possible. | Node metadata, workload observed status |
| Precondition-based | Changes carry a precondition ("set X to 5, if current is 3"). After merge, unmet preconditions are rejected and the originator is informed. | Org config |
| Grow-only | Data can only be added, never removed. Merge is set union. No conflicts possible. | Revocation lists |
| CRDT-native | The CRDT's built-in merge semantics handle it (add-wins for Orswot, multi-value for MVReg). | Membership |
The general principle behind all schemas is locality-wins, which ties directly to the three-state confirmation pattern:
- The local partition's change was confirmed -- it was enacted, the partition could observe the resource and verify the result.
- The remote partition's change was pending -- it was intended, but the partition couldn't reach the resource to verify.
- When the partitions reconnect and merge, the pending change's precondition no longer holds (the resource has already been changed by the confirmed side). The pending change is rejected, and the originator is informed so they can decide whether to re-apply against the current state.
This means conflict resolution isn't a special case -- it's the normal three-state lifecycle applied to concurrent changes.
Stage Boundary
What This Stage Produces
After cluster formation:
- The node is a full org member (confirmed enrollment)
- Org state is replicated and converging via gossip + TreeSync
- The node can participate in org decisions (config changes, workload placement)
- Failure detection is active (SWIM protocol monitors all peers)
What Is Handed Off
The replicated state enables:
- 09 Running Workloads: WorkloadDesired tree tells reconcilers what to run
- 10 Sustaining the Org: OrgConfig tree propagates configuration changes
What This Stage Does NOT Do
- It does not run workloads (that's 09 Running Workloads)
- It does not handle upgrades (that's 10 Sustaining the Org)
- It does not provide strong consistency (CRDTs are eventually consistent)
- It does not handle large data replication (that's the storage layer)
Further Reading
Concepts:
- Gossip Protocols -- SWIM protocol and membership management
- CRDTs -- Machine-resolvable conflict-free data structures
- Merkle Trees -- Hash trees for divergence detection
- Three-State Pattern -- How conflicts are tracked and surfaced
- Org Bootstrap -- How the first nodes form an org from zero
- Topology Map -- Physical topology driving placement and alerting
Services:
- Maintainer -- The gossip participant and CRDT owner on each node
FortrOS implementation:
- 04-distributed-state.md -- CRDT design, conflict resolution, TreeSync protocol