09 Running Workloads

The Problem

The org is formed, state is converging (08 Cluster Formation). Now the cluster needs to do something useful: run applications. A database, a web server, the org's build system, user desktops.

But running applications on a cluster is harder than running them on a single machine. You need:

Isolation: One application can't crash or compromise another
Scheduling: Which machine runs which workload?
Encryption: Each service's data should be encrypted with its own key
Declaration: Describe WHAT should run, not HOW to start it
Convergence: If reality doesn't match the declaration, fix it automatically

Process Isolation

Namespaces and Cgroups

Linux provides two mechanisms for isolating processes:

Namespaces give a process its own view of the system. A process in a PID namespace sees itself as PID 1 -- it can't see or signal processes outside its namespace. There are several namespace types:

Namespace	Isolates
PID	Process IDs (can't see other processes)
Mount	Filesystem mounts (own root filesystem)
Network	Network interfaces, IPs, ports
UTS	Hostname
User	UIDs/GIDs (can be "root" inside but unprivileged outside)
IPC	Inter-process communication

Cgroups (control groups) limit resource usage. A cgroup can restrict: CPU time, memory, disk I/O, number of processes. If a container's cgroup says "max 512MB RAM," the kernel kills processes that exceed it.

Together, namespaces + cgroups = containers. Docker, Podman, and LXC are tooling around these kernel primitives. FortrOS uses them directly: unshare + pivot_root + cgroup v2, about 200 lines of Rust via the nix crate.

Why not Docker, containerd, or LXC? This is the first question anyone experienced with containers will ask. The answer:

FortrOS containers are simple and known. Org containers (tier 2) run trusted org infrastructure: monitoring, DNS, internal services. They're not arbitrary images pulled from Docker Hub. They're known binaries from org shard storage. Docker's image layer system, registry protocol, pull machinery, and image verification are designed for running untrusted code from the internet -- FortrOS doesn't need any of that.
Docker's features duplicate FortrOS's. Networking? FortrOS has WireGuard. Storage drivers? FortrOS has dm-crypt and shard storage. Orchestration? FortrOS has the reconciler. Image distribution? FortrOS has erasure-coded org storage. Adding Docker would mean running a parallel infrastructure stack that duplicates what FortrOS already provides.
Attack surface. Docker daemon runs as root with a ~200K+ line codebase and a history of CVEs. containerd is smaller but substantial. FortrOS's container runtime is ~200 lines. For an OS where the base image IS the security boundary, every dependency in the trust anchor matters.
No daemon dependency. Docker requires dockerd -- if the daemon crashes, all containers lose their parent. FortrOS's containers are supervised directly by s6, the same process supervisor that manages everything else.

LXC is the closest to what FortrOS does (raw namespace management, no required daemon). But LXC brings liblxc, configuration machinery, and assumptions about rootfs layout that don't match FortrOS's immutable image model. For the narrow use case of "run a few trusted org binaries in isolated namespaces," ~200 lines of purpose-built Rust is simpler than adapting a general-purpose framework.

For workloads that need real isolation from the host (org VMs, user VMs), FortrOS uses a proper hypervisor (cloud-hypervisor + KVM), not containers.

VMs: Stronger Isolation

Containers share the host kernel. A kernel exploit in a container can compromise the host. Virtual machines are stronger: each VM runs its own kernel on virtual hardware. The hypervisor (KVM + cloud-hypervisor) provides the isolation boundary.

Trust Boundaries: The Real Design Principle

FortrOS uses both containers and VMs, but the choice isn't about workload weight -- it's about trust relative to the host. The design principle: getting root on a node should NOT give you admin access to the org.

Category	Trust relationship to host	Isolation	Example
Node services	IS the host (fully trusted)	Process only (s6-rc)	maintainer, key-service
Org containers	Isolated from each other, but host CAN access (shared kernel)	Namespaces + cgroups	monitoring, DNS
Org VMs	Isolated FROM the host (hypervisor boundary)	KVM (separate kernel)	build VM, database
User VMs	Isolated from host AND org (user's private space)	KVM + optional SEV-SNP	desktops, dev environments

The critical boundary is between node services and everything else. A compromised maintainer (node-level root) should not be able to read the key service's secrets if the key service runs in a VM. With SEV-SNP (AMD hardware memory encryption), the host literally cannot read VM memory, even with root.

Even services on the same physical machine communicate through WireGuard loopback -- conn_auth applies between co-located services. The key service listens only on its WireGuard overlay address. A compromised process on the host can't bypass network authentication to reach it.

Per-Service Encryption

Every org service gets its own encrypted scratch volume via dm-crypt:

Create a sparse file (only uses disk space as written)
Set up a loop device
Format with LUKS2
Open and format with ext4
Mount as the service's data directory

The encryption key comes from the key service: HKDF(master_key, service_name). Each service's data is encrypted with a unique derived key. Compromise one service's key and the others are safe. If the node reboots, scratch volumes are recreated (they're ephemeral).

The Reconciler

The reconciler is the core of workload management. It's a level-triggered reconciliation loop -- the same pattern used by Kubernetes controllers:

loop:
  desired = read WorkloadDesired CRDT (from maintainer via IPC)
  observed = read local workload status

  diff = desired - observed

  for each workload in diff:
    if should_be_running and not_running: start it
    if should_not_be_running and running: stop it
    if running but wrong config: restart with new config

  report observed state back to maintainer
  sleep 10 seconds

Level-triggered means: the reconciler doesn't care about events ("workload X was just created"). It cares about state ("workload X should be running but isn't"). Think of a thermostat: it doesn't track events (someone opened a window, the oven is on). It reads the current temperature. Too cold? Heat on. Too hot? Heat off. If the thermostat reboots and misses the "window opened" event, it still works -- it reads the temperature and acts. The reconciler is the same: read desired state, read actual state, fix the difference. Resilient to missed events, duplicated messages, and crashes.

The reconciler communicates with the maintainer via IPC (localhost:7208, postcard-encoded frames). The maintainer owns all replicated state (CRDTs). The reconciler is a pure client -- no CRDT code, no gossip, no networking.

How Others Do It

Kubernetes: Full Orchestration

Kubernetes has controllers for every resource type (Deployments, StatefulSets, DaemonSets), a scheduler for placement decisions, and kubelets on each node that manage pods. The architecture is mature but complex: etcd for state, API server for access, scheduler for placement, controller-manager for reconciliation, kubelet for execution.

Strength: Massive ecosystem, battle-tested, handles everything. Weakness: Heavy (control plane is multiple processes), requires etcd quorum, designed for hundreds/thousands of containers not VMs.

Nomad: Simpler Orchestration

HashiCorp Nomad uses a simpler model: a single binary handles both scheduling and execution. Supports containers, VMs, and raw processes. Uses Raft for consensus but is lighter than Kubernetes.

Strength: Simpler to operate, multi-workload-type support. Weakness: Smaller ecosystem, still requires Raft majority.

Proxmox: VM Management

Proxmox is a VM management platform using KVM/QEMU. Cluster state is managed via corosync (quorum-based). Web UI for VM creation, migration, and monitoring.

Strength: Mature VM management, good UI. Weakness: Quorum-based (cluster of 3 required), not designed for containers, no declarative workload model.

The Tradeoffs

Feature	Kubernetes	Nomad	Proxmox	FortrOS
Workload types	Containers	Containers, VMs, processes	VMs	Trust-based (node services, org containers, org VMs, user VMs)
Consensus	etcd (Raft)	Raft	corosync (quorum)	CRDTs (no quorum)
State model	API server + etcd	Nomad server + Raft	corosync	Maintainer + gossip
Declarative	Yes (YAML manifests)	Yes (HCL job files)	Partial (UI-driven)	Yes (TOML manifests)
Per-service encryption	No (application responsibility)	No	No	Yes (dm-crypt per service)

How FortrOS Does It

Declarative Workloads

Workloads are defined as TOML manifests applied via CLI:

name = "my-database"
tier = "org-vm"
image = "org:postgres:16"
replicas = 1
cpus = 2
memory_mb = 2048
disk_mb = 10240
network = "overlay"

maintainer apply /path/to/manifest.toml writes the spec to the WorkloadDesired CRDT. Gossip propagates it. Reconcilers on eligible nodes pick it up and start the workload.

VMM: cloud-hypervisor

For tier 3/4 VMs, FortrOS uses cloud-hypervisor -- a Rust-native VMM (Virtual Machine Monitor) with a REST API, live migration support, and SEV-SNP/TDX support. The reconciler:

Assembles the VM's disk (base image from org storage + qcow2 COW overlay)
Creates a TAP interface for the specified network profile
Starts cloud-hypervisor with the VM configuration
Monitors health via the REST API

VM Networking

VMs declare a network profile; FortrOS handles the plumbing:

Profile	What the VM sees	What FortrOS creates
overlay	WG mesh interface	TAP bridged to WireGuard overlay
direct	Host network bridge	TAP bridged to physical NIC
isolated	Loopback only	No network plumbing

The VM never configures networking. Interfaces appear via DHCP or static config from the manifest.

Confidential VMs

On AMD EPYC processors with SEV-SNP (or Intel with TDX), VMs can run with hardware memory encryption. The host cannot read VM memory -- the CPU encrypts it transparently with per-VM keys managed by the AMD PSP. FortrOS detects SEV-SNP/TDX at runtime and enables it when:

The hardware supports it
The org policy requires or allows it
The workload tier requests it

Stage Boundary

What This Stage Produces

After workload reconciliation:

Containers and VMs are running per the desired state
Each service has its own encrypted scratch volume
Observed workload state is gossiped to the org (tree 0x04)
VMs have network connectivity per their profile

What Is Handed Off

The running workloads ARE the org's purpose. The final chapter -- 10 Sustaining the Org -- covers how to keep everything running through upgrades, configuration changes, and failures.

Live Migration

VMs can be migrated between hosts while running. The reconciler on the source node coordinates with the reconciler on the destination via authenticated TCP over WireGuard. cloud-hypervisor's native send/receive-migration API handles the actual state transfer: memory pages are copied iteratively (dirty pages re-sent until convergence), then the VM is paused, final state transferred, and resumed on the destination. The workload's overlay network address follows it -- WireGuard peer entries update via gossip.

Migration is triggered by the scoring/placement system: if a node is being drained for upgrade, or if a better-scored host becomes available, the reconciler moves VMs to maintain the desired placement.

What This Stage Does NOT Do

It does not handle user-facing profiles (COW/WAL sync for roaming desktops)
It does not handle per-app streaming (Vulkan-to-WebRTC for heavy apps on thin clients)