Gossip Protocols
What It Is
A gossip protocol is a communication pattern where nodes in a network share information by periodically exchanging messages with randomly selected peers. Information spreads like a rumor: tell a few people, they each tell a few people, and soon everyone knows.
The key property: no central coordinator. No node has a special role. Any node can learn any information from any other node. If some nodes fail, the information still spreads through the remaining nodes.
Why It Matters
In a distributed system, you need to answer: "Who's in the cluster? Who's alive? Who just joined? Who left?" Without gossip, you'd need a central registry (single point of failure) or all-to-all heartbeats (O(N^2) traffic).
Gossip scales: each node talks to a constant number of peers per round, regardless of cluster size. Information reaches all N nodes in O(log N) rounds. The protocol is self-healing -- if a node misses a round, it catches up in the next.
How It Works
SWIM (Scalable Weakly-consistent Infection-style Membership)
FortrOS uses SWIM, a specific gossip protocol designed for membership and failure detection. SWIM has two components:
Failure detection (ping/ping-req):
- Each round, a node picks a random peer and sends a ping
- If the peer responds, it's alive
- If no response, the node picks K other random peers and asks them to ping-req (probe the suspect on its behalf)
- If none of the K peers get a response, the suspect is declared failed
This indirect probing prevents false positives from temporary network issues between two specific nodes.
Dissemination (piggybacking): SWIM piggybacks membership updates (joins, failures, metadata) on the ping and ping-req messages. No separate broadcast -- information spreads as part of the failure detection traffic. This is efficient: the protocol overhead for membership is zero beyond what failure detection already requires.
FortrOS Extensions
FortrOS extends SWIM (via the foca Rust crate) with:
- Hash digests: Each gossip round, nodes include the Merkle root hash of their state trees (~65 bytes). Receivers compare hashes to detect state divergence.
- Hash history: Each digest includes a configurable number of recent Merkle roots (default: a ring of recent hashes). If a peer broadcasts a hash that's in your history, you know you're ahead -- no sync needed. The history depth is tunable: a larger history catches more stale peers without a TCP pull, at the cost of slightly larger gossip messages.
- Originator-only broadcasting: Only the node that made a change broadcasts its hash. Nodes that received state via TreeSync do not re-broadcast. Convergence still happens because gossip is probabilistic (random peer selection each round) and TreeSync pulls are also periodic.
How FortrOS Uses It
The maintainer agent runs a SWIM gossip loop that:
- Detects peer failures (ping/ping-req every few seconds)
- Broadcasts state tree hash digests
- Triggers TreeSync pulls when hashes mismatch
- Reports membership changes (join/leave) to the rest of the system
Gossip runs over the WireGuard overlay. All messages are encrypted by WireGuard at the transport level.
Zone-level failure detection: SWIM detects individual node failures, but the Topology Map aggregates these into higher-level events. If all nodes in a topological zone become unreachable simultaneously, that's likely a network partition or infrastructure failure (switch died, power outage), not multiple independent node failures. The admin interface surfaces zone-level events distinctly from individual node failures -- "zone-3 unreachable" is a different response than "node-47 down."
Alternatives
Raft/Paxos consensus: Strong consistency but requires a majority to be reachable. Not suitable for partition-tolerant systems.
All-to-all heartbeats: Every node sends heartbeats to every other node. Simple but O(N^2) traffic. Doesn't scale.
Central registry (etcd, ZooKeeper): Strong consistency and ordered operations, but single point of failure (even with replication, the registry must be reachable by a majority).
Links
- SWIM Paper (PDF)
- foca -- Rust SWIM implementation
- Serf -- HashiCorp's gossip-based membership tool