09 Running Workloads
The Problem
The org is formed, state is converging (08 Cluster Formation). Now the cluster needs to do something useful: run applications. A database, a web server, the org's build system, user desktops.
But running applications on a cluster is harder than running them on a single machine. You need:
- Isolation: One application can't crash or compromise another
- Scheduling: Which machine runs which workload?
- Encryption: Each service's data should be encrypted with its own key
- Declaration: Describe WHAT should run, not HOW to start it
- Convergence: If reality doesn't match the declaration, fix it automatically
Process Isolation
Namespaces and Cgroups
Linux provides two mechanisms for isolating processes:
Namespaces give a process its own view of the system. A process in a PID namespace sees itself as PID 1 -- it can't see or signal processes outside its namespace. There are several namespace types:
| Namespace | Isolates |
|---|---|
| PID | Process IDs (can't see other processes) |
| Mount | Filesystem mounts (own root filesystem) |
| Network | Network interfaces, IPs, ports |
| UTS | Hostname |
| User | UIDs/GIDs (can be "root" inside but unprivileged outside) |
| IPC | Inter-process communication |
Cgroups (control groups) limit resource usage. A cgroup can restrict: CPU time, memory, disk I/O, number of processes. If a container's cgroup says "max 512MB RAM," the kernel kills processes that exceed it.
Together, namespaces + cgroups = containers. Docker, Podman, and LXC are
tooling around these kernel primitives. FortrOS uses them directly:
unshare + pivot_root + cgroup v2, about 200 lines of Rust via the nix
crate.
Why not Docker, containerd, or LXC? This is the first question anyone experienced with containers will ask. The answer:
- FortrOS containers are simple and known. Org containers (tier 2) run trusted org infrastructure: monitoring, DNS, internal services. They're not arbitrary images pulled from Docker Hub. They're known binaries from org shard storage. Docker's image layer system, registry protocol, pull machinery, and image verification are designed for running untrusted code from the internet -- FortrOS doesn't need any of that.
- Docker's features duplicate FortrOS's. Networking? FortrOS has WireGuard. Storage drivers? FortrOS has dm-crypt and shard storage. Orchestration? FortrOS has the reconciler. Image distribution? FortrOS has erasure-coded org storage. Adding Docker would mean running a parallel infrastructure stack that duplicates what FortrOS already provides.
- Attack surface. Docker daemon runs as root with a ~200K+ line codebase and a history of CVEs. containerd is smaller but substantial. FortrOS's container runtime is ~200 lines. For an OS where the base image IS the security boundary, every dependency in the trust anchor matters.
- No daemon dependency. Docker requires dockerd -- if the daemon crashes, all containers lose their parent. FortrOS's containers are supervised directly by s6, the same process supervisor that manages everything else.
LXC is the closest to what FortrOS does (raw namespace management, no required daemon). But LXC brings liblxc, configuration machinery, and assumptions about rootfs layout that don't match FortrOS's immutable image model. For the narrow use case of "run a few trusted org binaries in isolated namespaces," ~200 lines of purpose-built Rust is simpler than adapting a general-purpose framework.
For workloads that need real isolation from the host (org VMs, user VMs), FortrOS uses a proper hypervisor (cloud-hypervisor + KVM), not containers.
VMs: Stronger Isolation
Containers share the host kernel. A kernel exploit in a container can compromise the host. Virtual machines are stronger: each VM runs its own kernel on virtual hardware. The hypervisor (KVM + cloud-hypervisor) provides the isolation boundary.
Trust Boundaries: The Real Design Principle
FortrOS uses both containers and VMs, but the choice isn't about workload weight -- it's about trust relative to the host. The design principle: getting root on a node should NOT give you admin access to the org.
| Category | Trust relationship to host | Isolation | Example |
|---|---|---|---|
| Node services | IS the host (fully trusted) | Process only (s6-rc) | maintainer, key-service |
| Org containers | Isolated from each other, but host CAN access (shared kernel) | Namespaces + cgroups | monitoring, DNS |
| Org VMs | Isolated FROM the host (hypervisor boundary) | KVM (separate kernel) | build VM, database |
| User VMs | Isolated from host AND org (user's private space) | KVM + optional SEV-SNP | desktops, dev environments |
The critical boundary is between node services and everything else. A compromised maintainer (node-level root) should not be able to read the key service's secrets if the key service runs in a VM. With SEV-SNP (AMD hardware memory encryption), the host literally cannot read VM memory, even with root.
Even services on the same physical machine communicate through WireGuard loopback -- conn_auth applies between co-located services. The key service listens only on its WireGuard overlay address. A compromised process on the host can't bypass network authentication to reach it.
Per-Service Encryption
Every org service gets its own encrypted scratch volume via dm-crypt:
- Create a sparse file (only uses disk space as written)
- Set up a loop device
- Format with LUKS2
- Open and format with ext4
- Mount as the service's data directory
The encryption key comes from the key service: HKDF(master_key, service_name).
Each service's data is encrypted with a unique derived key. Compromise one
service's key and the others are safe. If the node reboots, scratch volumes
are recreated (they're ephemeral).
The Reconciler
The reconciler is the core of workload management. It's a level-triggered reconciliation loop -- the same pattern used by Kubernetes controllers:
loop:
desired = read WorkloadDesired CRDT (from maintainer via IPC)
observed = read local workload status
diff = desired - observed
for each workload in diff:
if should_be_running and not_running: start it
if should_not_be_running and running: stop it
if running but wrong config: restart with new config
report observed state back to maintainer
sleep 10 seconds
Level-triggered means: the reconciler doesn't care about events ("workload X was just created"). It cares about state ("workload X should be running but isn't"). Think of a thermostat: it doesn't track events (someone opened a window, the oven is on). It reads the current temperature. Too cold? Heat on. Too hot? Heat off. If the thermostat reboots and misses the "window opened" event, it still works -- it reads the temperature and acts. The reconciler is the same: read desired state, read actual state, fix the difference. Resilient to missed events, duplicated messages, and crashes.
The reconciler communicates with the maintainer via IPC (localhost:7208, postcard-encoded frames). The maintainer owns all replicated state (CRDTs). The reconciler is a pure client -- no CRDT code, no gossip, no networking.
How Others Do It
Kubernetes: Full Orchestration
Kubernetes has controllers for every resource type (Deployments, StatefulSets, DaemonSets), a scheduler for placement decisions, and kubelets on each node that manage pods. The architecture is mature but complex: etcd for state, API server for access, scheduler for placement, controller-manager for reconciliation, kubelet for execution.
Strength: Massive ecosystem, battle-tested, handles everything. Weakness: Heavy (control plane is multiple processes), requires etcd quorum, designed for hundreds/thousands of containers not VMs.
Nomad: Simpler Orchestration
HashiCorp Nomad uses a simpler model: a single binary handles both scheduling and execution. Supports containers, VMs, and raw processes. Uses Raft for consensus but is lighter than Kubernetes.
Strength: Simpler to operate, multi-workload-type support. Weakness: Smaller ecosystem, still requires Raft majority.
Proxmox: VM Management
Proxmox is a VM management platform using KVM/QEMU. Cluster state is managed via corosync (quorum-based). Web UI for VM creation, migration, and monitoring.
Strength: Mature VM management, good UI. Weakness: Quorum-based (cluster of 3 required), not designed for containers, no declarative workload model.
The Tradeoffs
| Feature | Kubernetes | Nomad | Proxmox | FortrOS |
|---|---|---|---|---|
| Workload types | Containers | Containers, VMs, processes | VMs | Trust-based (node services, org containers, org VMs, user VMs) |
| Consensus | etcd (Raft) | Raft | corosync (quorum) | CRDTs (no quorum) |
| State model | API server + etcd | Nomad server + Raft | corosync | Maintainer + gossip |
| Declarative | Yes (YAML manifests) | Yes (HCL job files) | Partial (UI-driven) | Yes (TOML manifests) |
| Per-service encryption | No (application responsibility) | No | No | Yes (dm-crypt per service) |
How FortrOS Does It
Declarative Workloads
Workloads are defined as TOML manifests applied via CLI:
name = "my-database"
tier = "org-vm"
image = "org:postgres:16"
replicas = 1
cpus = 2
memory_mb = 2048
disk_mb = 10240
network = "overlay"
maintainer apply /path/to/manifest.toml writes the spec to the
WorkloadDesired CRDT. Gossip propagates it. Reconcilers on eligible nodes
pick it up and start the workload.
VMM: cloud-hypervisor
For tier 3/4 VMs, FortrOS uses cloud-hypervisor -- a Rust-native VMM (Virtual Machine Monitor) with a REST API, live migration support, and SEV-SNP/TDX support. The reconciler:
- Assembles the VM's disk (base image from org storage + qcow2 COW overlay)
- Creates a TAP interface for the specified network profile
- Starts cloud-hypervisor with the VM configuration
- Monitors health via the REST API
VM Networking
VMs declare a network profile; FortrOS handles the plumbing:
| Profile | What the VM sees | What FortrOS creates |
|---|---|---|
| overlay | WG mesh interface | TAP bridged to WireGuard overlay |
| direct | Host network bridge | TAP bridged to physical NIC |
| isolated | Loopback only | No network plumbing |
The VM never configures networking. Interfaces appear via DHCP or static config from the manifest.
Confidential VMs
On AMD EPYC processors with SEV-SNP (or Intel with TDX), VMs can run with hardware memory encryption. The host cannot read VM memory -- the CPU encrypts it transparently with per-VM keys managed by the AMD PSP. FortrOS detects SEV-SNP/TDX at runtime and enables it when:
- The hardware supports it
- The org policy requires or allows it
- The workload tier requests it
Stage Boundary
What This Stage Produces
After workload reconciliation:
- Containers and VMs are running per the desired state
- Each service has its own encrypted scratch volume
- Observed workload state is gossiped to the org (tree 0x04)
- VMs have network connectivity per their profile
What Is Handed Off
The running workloads ARE the org's purpose. The final chapter -- 10 Sustaining the Org -- covers how to keep everything running through upgrades, configuration changes, and failures.
Live Migration
VMs can be migrated between hosts while running. The reconciler on the source node coordinates with the reconciler on the destination via authenticated TCP over WireGuard. cloud-hypervisor's native send/receive-migration API handles the actual state transfer: memory pages are copied iteratively (dirty pages re-sent until convergence), then the VM is paused, final state transferred, and resumed on the destination. The workload's overlay network address follows it -- WireGuard peer entries update via gossip.
Migration is triggered by the scoring/placement system: if a node is being drained for upgrade, or if a better-scored host becomes available, the reconciler moves VMs to maintain the desired placement.
What This Stage Does NOT Do
- It does not handle user-facing profiles (COW/WAL sync for roaming desktops)
- It does not handle per-app streaming (Vulkan-to-WebRTC for heavy apps on thin clients)
Further Reading
Concepts:
- Namespaces and Cgroups -- Linux process isolation primitives
- dm-crypt -- Per-service encryption
- Erasure Coding -- How org storage protects data
- Content-Addressed Storage -- How org storage stores data
- Service Architecture -- Silo'd design and workload tiers
- Client Profiles and Roaming -- User VM roaming and COW/WAL sync
- App Streaming -- Per-window streaming for thin clients
Hardware:
- KVM -- The hypervisor for VMs
Services:
- Reconciler -- Level-triggered workload lifecycle
- Key Service -- Per-service key derivation