Luke Angel
← back to the bookcase
A loose mesh of small nodes around two larger rust-orange hub nodes, connected by thin spokes, with a single dashed bridge link crossing through an intermediate ring node. Cream background, faint dot grid, vertical rust-orange accent bar at the left edge. Notebook · 08 parts
Notebook · 8 parts · read in order
~72 min total

Building A Distributed Mesh in Rust

I wanted to feel where a self-organizing P2P mesh in Rust breaks before I trusted it with anything. iroh-gossip over QUIC, run small and run hard. The first 18 nodes pegged an 80-core box at 100% CPU; weeks of flamegraphs and soak tests later the same 18 idled at 5%. This notebook is the work as it happened — the wrong hypotheses, the flamegraphs, the leaks (some mine, some upstream — a churn leak I chased for two days turned into fixes across three crates), and then the harder build: a multi-mesh fabric you can actually see.

I wanted to know what it actually costs to run a self-organizing peer-to-peer mesh on commodity hardware — where it strains, where it leaks, and whether I could trust it under sustained churn. The build is Rust: an iroh-gossip layer over QUIC, nodes addressed by public key, NAT traversal handled by the fabric. I built it small — a handful of node types on one box — and ran it hard enough to break.

It broke. The first canary pegged an 80-core box at 100% CPU with 18 nodes sitting idle. Weeks of measuring, theorizing, and being wrong got it down to 5%. Most of the bugs were mine — a tokio::spawn whose task nobody owned, a DashMap that grew forever because nothing was told to prune it. But the one I chased hardest, a connection leak under sustained churn, turned out to live in the stack itself: a select! arm in the gossip layer that silently disabled itself, an address cache with no eviction, a QUIC handshake that leaked when abandoned. Two days of soak-and-profile, and the ending wasn't a workaround — it was a cluster of small fixes submitted upstream to three crates.

This notebook is the work in order, as it happened — the first canary, the flamegraph that found a self-inflicted handshake storm, four days of soak that surfaced five leaks at once, a chaos battery that replaced "tests pass," open-sourcing the result, the two-day churn-leak hunt that went upstream — and then the harder build: turning it into a multi-mesh fabric you can actually operate. Named nodes in named meshes, a gossiped topology cache, cross-mesh delivery with no bespoke bridge, a relay that carries traffic when there's no direct path, and traces honest enough to trust. Every claim in those later posts is backed by a screenshot of the live system.

Start here
01 · Why I'm building a distributed mesh substrate in Rust
open part 01 →
A small mesh of nodes around a single rust-orange hub, spokes radiating outward, on a cream background with faint dot grid and a vertical rust-orange accent bar at the left edge. Part 01 of 08
Building A Distributed Mesh in Rust · part 01
Apr 17, 2026

Why I'm building a distributed mesh substrate in Rust

Before I build anything on top of a peer-to-peer substrate, I need to know whether the substrate itself is sound. The choice is iroh-gossip over QUIC. The first canary is 18 nodes on one box. Here's what I'm trying to learn and what I expect to break.

What I'm ultimately after is an event-streaming system whose nodes can sit on different networks — different cloud regions, different colos, a laptop on a coffee-shop Wi-Fi — and form a working cluster anyway. Direct-TCP-on-a-VPC isn't enough for that. The transport layer has to handle NAT traversal, identity, mesh formation, and reconnect for me. But the streaming layer is a problem for later; first I need the substrate underneath it to be sound, and this notebook is about the substrate.

I picked iroh for the substrate. QUIC under the hood, relay tier for cross-NAT, hole-punching where it can. The application layer on top is iroh-gossip — Plumtree for eager broadcast, HyParView for membership. The combination gives me a self-organizing peer set without writing any of it myself.

The first thing I'm going to do is run 18 of these on one machine and watch what happens.

Why iroh, not direct TCP

The conventional design does the simplest thing that works: direct TCP between nodes, a metadata store like ZooKeeper or etcd, every node knowing every other by hostname. That works because the nodes live in a single network where hostnames resolve and ports are open.

The deployments I want to support don't look like that. A node behind a residential NAT. A compute box in someone's homelab. A gateway in AWS. They can't reach each other on direct TCP. They CAN reach each other via QUIC + a relay, with hole-punching closing the hop where possible. That's exactly what iroh does, and I'd rather use a maintained library than build it.

What iroh buys me concretely:

  • Identity from a keypair, not a hostname. Every node has an Ed25519 keypair; the "address" is the public key. The same key from any IP works.
  • NAT traversal. STUN-style probing, hole-punching, relay fallback. I don't have to think about it.
  • QUIC transport. Multiplexed streams over one connection, 0-RTT reconnect, no head-of-line blocking. Better defaults than tuning TCP.
  • Discovery primitives. mDNS for LAN, DHT for internet-scale. Optional and pluggable.

The cost: more bytes per packet than raw TCP (TLS 1.3 framing + QUIC headers + congestion-controller state per connection). For a substrate that's going to broadcast small heartbeats, that's the right trade.

What iroh-gossip is for, and what it isn't

iroh-gossip runs Plumtree + HyParView on top of iroh's QUIC. Plumtree forms a spanning tree across subscribed peers for eager broadcast; HyParView keeps the per-node active connection set small and roughly constant (~5–7 peers regardless of cluster size). That's how Bitcoin scales to 50,000 nodes without every node connecting to every other.

The thing I want to be careful about is what I broadcast. iroh-gossip is a control-plane primitive — it's designed for "the cluster's membership just changed" or "a new topic appeared," not "here is my full state, every two seconds, forever." Bitcoin doesn't broadcast every node's status every 2 seconds. It broadcasts transactions when they arrive. Big difference.

I have a hunch this is where I'm going to get bit. I'm setting up the gossip emit to fire on a 500ms timer with a GossipDigest that includes peer counts, frame counts, CPU/RAM. That's a lot of state on a fast clock. We'll see.

The architecture, briefly

Three layers, with deliberate separation:

LayerCrateResponsibility
Transportcrates/mesh-transportiroh Endpoint setup, ALPN, bind addr, mDNS toggle
Substratecrates/mesh-node-baseIdentity, peer registry, gossip emit loop, LoadSampler (self-reported CPU/RAM via sysinfo), staleness handling
Telemetrycrates/mesh-telemetryOTLP/tracing init, every node's spans land in Jaeger

On top, five example node types — broker, gateway, compute, registry, bridge. From the substrate's perspective they're interchangeable; the type is just a string. Each one is a 10-line main.rs that calls NodeRuntime::new("type").run().await and supplies a .env.dev preset for its CPU/RAM budget.

There's also an admin-ui — a React + Vite dashboard that joins the mesh as a passive observer and renders the topology live. Hub-vs-leaf is visible in the layout. Each node card shows its CPU/RAM utilization against its declared budget. That's the surface I'll use to feel what the substrate is doing.

Telemetry is built in from boot. The admin-ui has a Boot Waterfall view that decomposes the spans every node fires on startup — endpoint creation, ALPN registration, gossip subscribe, accept loop. The shape of a healthy boot is five short spans nested under a mesh.node.ready root, sub-millisecond each on this machine. When something goes wrong at boot, you see which span stretched.

Admin-ui Boot Waterfall tab showing a single bridge node's boot timing. The root span mesh.node.ready spans the full timeline at 0.3 ms. Below it, four child spans appear in sequence — mesh.boot.endpoint_created, mesh.boot.alpn_registered, mesh.boot.gossip_started, mesh.boot.accept_loop_started — each rounded to 0.0 ms in the display, meaning each took under 50 microseconds. Dark theme, blue bars on a black background.

What I expect to break

I'm writing this down on purpose so I can be honest about the hypotheses going in:

  1. Gossip volume. At 18 nodes broadcasting every 500ms, that's 36 broadcasts/sec. Plumtree fans each one out across the spanning tree. I expect to find that 500ms is too aggressive for steady-state health.
  2. OTLP overhead. Every span we emit becomes a protobuf-encoded gRPC frame to Jaeger. If I'm not careful about what fires at what level, the telemetry will cost more than the work it's measuring.
  3. Connection accounting. Peers will reconnect over time. I expect there's a bookkeeping bug somewhere — registries that grow without pruning, connections that close without their tasks knowing. I haven't found it yet.

What I don't expect — and would be surprised by — is iroh itself being expensive in some fundamental way. The library is maintained by people who deal with this for a living.

What I'd tell someone starting

  • Pick the substrate first. The choice of transport (raw TCP, gRPC, iroh, libp2p, …) determines everything else. Don't pick the wire format before you've picked the network.
  • Don't broadcast state on a clock unless you've measured what it costs. Most "heartbeat" patterns assume small clusters, small payloads, slow cadences. Two seconds with a 200-byte digest at 18 nodes is already 36 broadcasts/sec.
  • Build the dashboard first. Or at least the topology view. You're going to be looking at this thing constantly while you debug it, and a print! doesn't compose into a mesh layout.

What's next

Tomorrow I bootstrap 18 nodes on my workstation and see what they do at idle. The plan: spawn one of each type, then 17 more, let them gossip for a minute, look at the numbers. The next post in this notebook is what those numbers were and what I did about them.

A blood-red CPU bar pegged at 100% utilization on a dashboard tile, set on a cream background with a faint dot grid and a vertical rust-orange accent bar at the left edge. Part 02 of 08
Building A Distributed Mesh in Rust · part 02
Apr 24, 2026

When 18 nodes pegged my 80-core box at 100%

First bootstrap of 18 mesh nodes on an 80-logical-core workstation. Host CPU pegged at 100% the moment the bootstrap finished. Three obvious things to try first — mDNS off, gossip interval up, per-frame INFO spans down — got it from 100% to 35%. Still wrong. The real bug was somewhere else.

Bootstrapped the cluster. 18 nodes — two of each role per mesh, across two meshes, with two bridges. Hit Bootstrap at 14:51 local. Host CPU at 14:52 was 100% across all 80 logical cores. Five samples a second apart, all 100%.

The mesh worked. /api/topology came back with 19/19 live (the admin-ui plus 18 children). Each node's GossipDigest was arriving. The Topology view in the dashboard painted. Bytes were moving. Nothing was crashing.

It was just consuming the entire machine to do that.

What the first reading said

I'd expected ~10% host load. The two Xeon Gold 6148s in this box are not a small machine — 80 logical processors, hyperthreaded across two sockets. A handful of small Rust processes broadcasting 200-byte digests every 500ms should not be pegging the entire system. My mental model of "iroh-gossip at 18 nodes" was Bitcoin-territory: single-digit % per node.

The actual reading per-process via Task Manager:

NodeCPU %What I expected
mesh-broker.exe × 40.66–1.52 each~0.05
mesh-gateway.exe × 40.72–1.32 each~0.05
mesh-compute.exe × 40.67–0.86 each~0.05
mesh-registry.exe × 40.63–1.80 each~0.05
mesh-bridge.exe × 20.82–0.83 each~0.05
mesh-admin-ui.exe × 13.44~1.0

Add the column up: ~18% of the 80-logical box, before counting iroh-quinn's kernel-side work. Performance Monitor's \Processor(_Total)\% Processor Time showed 95–100% sustained. Something else was eating the headroom.

Windows Task Manager CPU performance graph from the bootstrap moment. The left two-thirds of the chart show baseline utilization between 5 and 15 percent on a 60-second window. A vertical cliff in the middle of the timeline marks the bootstrap event, after which utilization jumps to 91 percent and stays sustained. The right panel labels the machine as an Intel Xeon Gold 6148 CPU at 2.40 GHz, with 2 sockets, 40 cores, 80 logical processors, 727 processes, 18341 threads, and 91 percent current utilization at 2.67 GHz. Total system RAM 215 of 256 GB. The cliff is the exact moment 18 mesh-node child processes were spawned.

The shape of the cost — every process burning roughly the same amount regardless of how many peers it had — pointed at per-tick work rather than per-peer work. So I started with three suspects that fire on a clock.

Three obvious things to try

1. Turn off mDNS

iroh-mdns was discovering every node on the local network and adding it to the active peer set. On a single box, that means 17 mDNS-announced peers per node — each one getting a QUIC handshake, each one getting added to HyParView's active view (which should be ~5, not 17). I had a hunch that the cluster was forming a full mesh rather than a sparse spanning tree.

MESH_MDNS_ENABLE=false, with explicit seed nodes injected at spawn time (admin-ui as the universal seed, plus 1–2 already-spawned same-mesh peers). Each child boots, dials its seeds, and lets HyParView shape the rest from there.

2. Slow the gossip cadence

MESH_GOSSIP_INTERVAL_MS was 500. At 18 nodes that's 36 broadcasts/sec across the cluster. For a control-plane heartbeat, that's overkill — Kafka's KRaft heartbeats every 1–3 seconds and considers a broker dead after 9 missed cycles. There's no reason substrate health needs 2 Hz granularity.

Bumped to 2000ms. 9 broadcasts/sec cluster-wide. Plumtree's IHAVE retransmits handle any actual loss.

3. Demote per-frame INFO spans to TRACE

This one I caught by reading the stdout. Every received gossip digest fired a tracing::info_span!("mesh.gossip.received", ...). With Plumtree's eager-push fanout, each digest arrives at a node ~17 times (once per peer in the active view, before lazy IHAVE deduplicates). At 36 broadcasts/sec × 17 fanout × N nodes, the cluster was producing ~2,600 INFO-level events per second.

Each event goes through tracing-subscriber's formatter (stdout write), then tracing-opentelemetry's layer (build a protobuf span, push to the OTLP batch queue, eventually send to Jaeger over gRPC). That's not free.

INFO is for state transitions — peer connected, peer disconnected, gossip topic subscribed. Per-frame events should be TRACE so they're filtered before any of that fires.

// before
tracing::info_span!("mesh.gossip.received", ...)
    .in_scope(|| info!(...));

// after
tracing::trace_span!("mesh.gossip.received", ...)
    .in_scope(|| tracing::trace!(...));

The reading after

Host CPU dropped from 100% to ~35%. Per-node CPU dropped from 0.5–1.5 cores down to roughly 0.25 cores. The log volume cratered — admin-ui's stdout went from 261,000 lines in 50 seconds to about 41,000.

That's a real improvement. It's also still wrong. A self-organizing P2P mesh of 18 idle nodes should not be using a quarter of a logical core per node. Bitcoin nodes idle at single-digit percent of one core, not 25% of one.

The hypothesis I started forming: "iroh-gossip itself is just expensive at this scale, the architecture is wrong, we should pivot to a centralized Controller." I spent two days seriously sketching the Controller architecture — a single coordinator, every node holds one connection to it, hub-and-spoke at the protocol layer. It would have worked. It would have been a much bigger change.

Before I committed to it, I decided to flamegraph the running cluster first.

That's the next post.

What I'd tell a team

  • Symptoms that look like "X is fundamentally too expensive" usually aren't. They're usually "you're doing X on a faster clock than you measured." Slow the clock before you blame the protocol.
  • Comment the cadence on every periodic loop. The 500ms gossip interval was a placeholder I never revisited. A code comment claiming "// every 500ms is fine, gossip is cheap" would have been a lie regardless of intent. Better: don't claim it's fine, claim what you measured.
  • INFO is for state transitions. TRACE is for per-frame events. DEBUG is for the boundary between those two — "useful when investigating, noisy in steady state." If you can't draw the line cleanly, your spans are doing too much.

What's next

The 35% reading was the trap. It looked like I was making progress, and it was real progress — but it convinced me the remaining cost was structural rather than a bug. The next post is the flamegraph that showed me how wrong I was, and the one-line fix that took the cluster from 35% to 5%.

A wide flamegraph silhouette with one disproportionately tall column on the left highlighted in rust-orange, suggesting one dominant hot function. Cream background, faint dot grid, vertical rust-orange accent bar at the left edge. Part 03 of 08
Building A Distributed Mesh in Rust · part 03
May 01, 2026

Flamegraphing your way out of "this can't possibly be right"

Two days into sketching a centralized-controller rewrite, I took a flamegraph instead. The hottest function in the mesh wasn't anything in the gossip protocol. It was an idempotent peer-join call I was making 10 times a second per peer — generating 3,240 QUIC handshakes per second across the cluster, doing exactly nothing useful.

I was two days into sketching a Controller architecture for the mesh. The reasoning went: 18 nodes at 35% host CPU after the obvious wins meant iroh-gossip itself was just expensive, the architecture was wrong, and the right move was a hub-and-spoke pattern with a coordinator that every node holds one connection to. A central sequencer instead of a peer-to-peer swarm. The model that's supposed to be the safe choice.

Halfway through the spike I noticed I hadn't actually profiled anything. The case for the Controller rewrite was built on a measurement gap, not a measurement. I closed the editor, added tracing-flame = "0.2.0" to the workspace, captured 15 seconds of flame data, and rendered it.

The graph was lopsided in a way that didn't fit my model.

What the flamegraph said

The dominant function in every node's CPU profile was iroh::endpoint::connect. Not Plumtree dissemination. Not packet decode. Not OTLP export. The thing every node was spending most of its time on was opening QUIC connections — at thousands of calls per second, on an 18-node cluster where nothing was disconnecting.

That should not be happening at all.

I went looking for who was calling endpoint.connect() in the hot path. The chain led to a sender.join_peers(peer_ids) call inside run_gossip's 100ms tick loop. There was a comment right next to it claiming:

// Feed mdns-discovered peers to gossip so the swarm forms.
// join_peers is idempotent — calling every tick with the current peer
// registry is cheap.

The comment had two correct words and one fatal one. "Idempotent" was true — calling join_peers with the same set of peers leaves the gossip state unchanged. "Cheap" was a lie. Idempotent in outcome says nothing about cost: under the hood, iroh interprets each call as "establish a transport connection to each of these peers," which kicks off a fresh TLS 1.3 handshake on top of QUIC's 1-RTT setup. Every time. For every peer. On a 100ms timer.

At 18 nodes with ~5 peers each in the active view: 18 × 5 × 10/sec = 900 handshake attempts per second per node, 3,240 cluster-wide. All landing on already-established connections that didn't need re-establishing. All doing exactly nothing useful for the gossip protocol.

The fix

A HashSet<String> of peers I've already told gossip about. Only call join_peers for entries that aren't in the set.

let mut joined_peers = HashSet::new();
loop {
    tokio::select! {
        _ = tick.tick() => {
            let mut new_peers = Vec::new();
            for peer in registry.iter() {
                if !joined_peers.contains(peer.key()) {
                    if let Ok(id) = iroh::EndpointId::from_str(peer.key()) {
                        new_peers.push(id);
                        joined_peers.insert(peer.key().clone());
                    }
                }
            }
            if !new_peers.is_empty() {
                let _ = sender.join_peers(new_peers).await;
            }
            // …rest of gossip emit loop
        }
    }
}

One commit. Rebuild. Re-canary.

Host CPU dropped from 35% to 6%. Per-node steady-state dropped to 0.05–0.10 cores. The mesh kept gossiping, the topology view kept painting, 19/19 stayed live. The Controller rewrite I'd been sketching became unnecessary in the time it took the canary to settle.

Single-mesh topology view from the admin-ui dashboard, dark theme. About eleven nodes arranged in a loose circular layout — registries in purple, gateways in blue, computes in green, brokers in orange, a bridge node at the top in gold. Solid blue lines mark within-mesh peer connections (mesh-a). Dashed yellow lines mark cross-mesh bridge connections. The graph density is sparse and shaped — about 5 to 7 peers per node — not the full N-by-N tangle you'd see from a CPU-storm. This is what HyParView's active view looks like at steady state once join_peers stops re-handshaking every 100ms.

Four more bugs in the same pass

While I was in there, the flamegraph and the diff surfaced four more:

Ghost tasks under #[instrument] on infinite loops. Two background loops (run_ping_sender, watch_mdns) were decorated with #[instrument(skip_all)]. On a normal short-lived async function, that creates a span that closes when the function returns. On an infinite loop, the root span never closes, and every event inside it gets appended to the span's child-event queue forever. OpenTelemetry's batch processor walks that queue on every export tick — over time, the walk is the cost. Removed the #[instrument] macros from the loop functions. Their inner spans still exist.

A busy-wait on a closed channel. Inside another tokio::select!, when the gossip event channel closed, the arm did continue — which made the task immediately re-poll, which immediately yielded None again, on a tight loop with no yield point. Changed to break. The task exits cleanly on channel close instead of spinning at 100% of one core.

Ghost QUIC connections on peer reconnect. When a peer disconnects and reconnects with the same identity, reg.insert(peer_id, new_conn) overwrites the registry entry without closing the previous Connection. iroh-quinn allows multiple parallel connections from the same identity, so the old one just sits there alive. The run_frame_reader and run_bi_echo_reader tasks holding the old Connection keep parking on accept_uni() / accept_bi() forever. Over hours of churn, dozens of dead connections accumulate per node. Fix: close the old connection before overwriting.

if let Some((_, old_conn)) = reg.remove(&peer_id) {
    old_conn.close(0u32.into(), b"superseded by new connection");
}
reg.insert(peer_id.clone(), conn.clone());

Stop-removing-from-sets. joined_peers never shrank. mesh_id_registry only inserted, never removed on disconnect. Both were minor compared to the join_peers storm, but both are real leaks over a long-running cluster. Added joined_peers.retain(|p| registry.contains_key(p)) on each tick, and mesh_id_registry.remove(&peer_id_str) next to the existing registry.remove in the disconnect handler.

What I'd tell a team

  • Comments lie. Flamegraphs don't. Every CPU bug I found had a comment next to it claiming the code was cheap. The comment described intent, not cost. If you're inheriting a codebase, treat every "this is cheap" comment as a hypothesis to verify.
  • Don't blame the protocol before you've measured. I came within a day of forking the architecture to a Controller pattern. The reasoning was internally consistent: gossip-fanout costs grow with peer count, we have a peer-count problem, therefore replace gossip. The flamegraph showed it had nothing to do with peer count — it was a 100ms tick doing the same expensive thing repeatedly. The architecture wasn't the bug.
  • #[instrument] on an infinite loop is almost always wrong. Spans are scoped to function lifetime. An infinite loop's span lives forever. Use #[instrument] on the work inside the loop, not on the loop itself.
  • Idempotent ≠ cheap. "Calling this multiple times has the same effect" says nothing about the cost of the calls. Especially in network code — endpoint.connect() is idempotent from the application's perspective but does a full TLS handshake every time.

For a second-channel sanity check, the Jaeger service-architecture DAG agreed with the dashboard. Service-level call counts looked reasonable for the workload, not the four-figures-per-second I'd been seeing pre-fix:

Jaeger UI System Architecture DAG view, light theme. Six service nodes connected by directed edges: broker at the top, data-gateway and gateway in the middle row, compute and compute-gateway in the next, registry at the bottom. Call counts on the edges are modest two-digit numbers — 27 between broker and data-gateway, 27 between broker and gateway, 12 between data-gateway and compute, 15 between data-gateway and gateway, 17 between compute and compute-gateway, 6 between gateway and registry. Reasonable service-to-service traffic for an idle mesh, no longer the thousands-per-second handshake pattern.

What's next

The 6% reading held for an hour. I left it running overnight to soak. The next morning it was at 100% again.

Different bug. Slower. Worse, because it took an overnight run to surface. Next post.

A line graph trending upward and to the right in rust-orange, on a cream background with faint dot grid and a vertical rust-orange accent bar at the left edge. The line never reaches a plateau. Part 04 of 08
Building A Distributed Mesh in Rust · part 04
May 08, 2026

Four days into the soak, the RAM was still climbing

Left the cluster running. Per-node CPU at 5% was real. The leak was somewhere else. Over 4 days of soak, 18 nodes climbed from 1 GB total RAM to 12 GB — and the worst offenders were nodes with zero active peers, holding the most state. Five accumulating data structures, no pruner, the time-tested pattern of "we'll clean it up later" never quite getting cleaned up.

After the join_peers storm fix the canary settled. Host CPU 6%, per-node CPU 0.05–0.10 cores, mesh stable, 19/19 live. I left it running overnight, expecting to come back to roughly the same numbers in the morning.

It came back to 100%.

Five days later, with the soak still going on the same processes, the numbers looked like this:

Metrict=0 fresht+32mint+105h (now)
Total CPU (sysinfo sum)1.38 cores2.59 cores14.90 cores
Total RAM1.04 gb1.32 gb11.87 gb
Per-node avg CPU0.0770.1440.83

The previous post celebrated 0.05–0.10 cores per node. Four days later the average was 0.83. Eleven times worse. The cluster wasn't crashing; it was eroding.

The fingerprint that didn't fit

The shape of the cost was the giveaway. Looking at top consumers after four days:

NodePeersCPURAM
compute-8ac4eca112.57 cores2.35 gb
gateway-16bfa75a01.88 cores1.64 gb
compute-fbedd6ed01.39 cores1.11 gb
broker-d8329b3a00.94 cores0.64 gb

The nodes burning the most CPU had zero peers. That's a contradiction in a healthy mesh — a node with no peer connections should be idle. Either the work isn't happening on connections at all, or the connections aren't being counted right.

admin-ui Nodes tab after a 104.9-hour soak, dark theme. A grid of node cards across two meshes. The expanded card is compute-fbedd6ed in mesh-b, peers 0, age 104.9 hours, status live. Both the CPU bar and the RAM bar are pegged solid red at 100 percent — CPU reads 1.39 of 1.00 cores, RAM reads 1.11 of 0.50 gb. The detail panel notes "no peer connections reported." This is a node that lost all its peers but is still doing 1.4 cores of work — the fingerprint of dead-state accumulation that the rest of the post unpacks.

Both turned out to be true. The work was happening on connections the peer count didn't know about — closed-but-not-cleaned Connection handles still pinned by background tasks — and on global maps that had been growing for 105 hours without anyone telling them to shrink.

Five leaks, in order of impact

1. Ghost QUIC connections on peer reconnect

iroh-quinn allows multiple parallel connections from the same identity. When the same peer reconnects after a network blip, the application sees a new Connection; the old one just sits there alive until something explicitly closes it. The accept loop was doing this:

reg.insert(peer_id.clone(), conn.clone());      // overwrites registry
tokio::spawn(run_bi_echo_reader(conn_bi));      // holds clone of new conn
run_frame_reader(/* ... */, conn /* moved */);  // holds new conn

What it never did was close the previous Connection before overwriting the entry. The old run_frame_reader task was still parked on accept_uni() of the old conn, which would never error because the old conn was never closed. Same with run_bi_echo_reader and accept_bi(). They sat there forever, holding Connection clones and keeping iroh-quinn's per-connection state (congestion controller, TLS session, packet pacer) live.

Fix:

if let Some((_, old_conn)) = reg.remove(&peer_id) {
    old_conn.close(0u32.into(), b"superseded by new connection");
}
reg.insert(peer_id.clone(), conn.clone());

Connection::close causes both accept_uni and accept_bi on the old conn to return Err, the old tasks break out cleanly, the iroh-quinn state drops.

2. live_digests and topic_membership grow forever

Two process-global DashMaps. live_digests is keyed by node_id with the latest GossipDigest received from that node. topic_membership is keyed by topic-label with a set of node_ids ever seen on that topic. Both are populated on every received digest. Neither had a pruning mechanism.

Every cluster respawn (and there had been many over the week of debugging) added entries with new node_ids — admin-ui pre-mints a fresh keypair per spawn, so every restart creates a new identity. Old identities never broadcast again. Their entries stay forever.

Fix: a single background task on a 5-second timer that scans live_digests, drops entries whose wall_time_ms is older than MESH_STALENESS_MS (default 30 seconds), then removes those node_ids from every topic's membership set.

async fn run_staleness_pruner() {
    let staleness_ms: u64 = std::env::var("MESH_STALENESS_MS")
        .ok().and_then(|s| s.parse().ok()).unwrap_or(30_000);
    let mut tick = tokio::time::interval(Duration::from_millis(5_000));
    loop {
        tick.tick().await;
        let now_ms = SystemTime::now()
            .duration_since(UNIX_EPOCH).map(|d| d.as_millis() as u64).unwrap_or(0);
        let stale: Vec<String> = live_digests().iter()
            .filter(|e| now_ms.saturating_sub(e.value().wall_time_ms) > staleness_ms)
            .map(|e| e.key().clone()).collect();
        for node_id in &stale {
            live_digests().remove(node_id);
        }
        for mut topic_entry in topic_membership().iter_mut() {
            for node_id in &stale {
                topic_entry.value_mut().remove(node_id);
            }
        }
    }
}

The self-injection on each node's gossip emit refreshes its own wall_time_ms, so a live node's entry never goes stale.

3. mesh_id_registry never pruned

Parallel to the peer PeerRegistry, there's a mesh_id_registry: DashMap<String, String> that maps peer_id → peer_mesh_id, populated from Hello frames so bridges can know which mesh a peer belongs to. The disconnect handler removed the peer from PeerRegistry but left the entry in mesh_id_registry. Over the soak, that map grew with every peer ever seen.

One-line fix in the existing disconnect handler:

Err(_) => {
    registry.remove(&peer_id_str);
    mesh_id_registry.remove(&peer_id_str);  // added this
    // ...
}

4. joined_peers HashSet only ever inserted

The fix from the previous post — joined_peers to dedupe join_peers calls — was correct, but I never made it shrink. If a peer disconnects, its entry in PeerRegistry is gone, but its entry in joined_peers lingers. On reconnect, the new connection wouldn't get join_peers called for it, because the old entry was still there.

joined_peers.retain(|p| registry.contains_key(p)) on every tick. Self-trimming.

5. dial_seeds one-shot, no retry

Not a leak per se but discovered during the same audit. dial_seeds did a single endpoint.connect() per seed at boot. If the seed wasn't up yet (race during cluster bootstrap), the child was permanently isolated — no retry, no fallback. Replaced with per-seed tokio tasks doing exponential-backoff retry: 1s, 2s, 4s, 8s, 16s, capped at 30s, max 10 attempts. Spans mesh.seed.retry and mesh.seed.giveup are emitted at INFO so seed-side issues are visible in Jaeger.

What I'd tell a team

  • Soak the substrate before you trust it. A 30-second canary won't catch slow leaks. Five minutes won't either. The bugs in this post took 4 days to surface in a meaningful way. If your substrate matters, leave it running overnight before you ship anything on top of it.
  • Process-global state needs an owner. Every DashMap that lives for the process lifetime needs a clear answer to "what removes entries from this?" If the answer is "nothing, the process restarts and it's fine" — that's a hidden cluster-restart dependency. Add a pruner.
  • Closed channels and overwritten registry entries are not free. The tokio::select! continue on a closed channel that spins a core, the DashMap::insert over an existing key whose value held resources — both compile, both look fine in review, both cost you real CPU and memory. When you find yourself overwriting a key whose previous value held something with Drop semantics (a Connection, a JoinHandle, a File), explicitly handle the old value.
  • The fingerprint matters. Top CPU consumers having zero peers was the clue. It told me the problem couldn't be peer-count-dependent — it had to be time-dependent and stateful. That narrowed the search from "anywhere in the gossip protocol" to "what state grows when we lose connections."

After the five fixes landed I started another soak, this time with the new pruner emitting mesh.staleness.pruned spans every 5 seconds and a 30s MESH_STALENESS_MS window. The topology view stayed populated with real values rather than ghost entries:

Topology tab from the admin-ui dashboard after the five soak-fix commits landed. Two regions — mesh-a with 8 nodes and mesh-b with 8 nodes — plus two bridge nodes at the top connecting them. Each node tile is colored by type (purple registry, blue gateway, green compute, orange broker, gold bridge) and labeled with TX/RX frame counters and CPU/MEM utilization. CPU values range 0.07 to 0.4 cores against 1.0–4.0 budgets — solidly green, no nodes pegged. MEM values 0.07 to 0.10 GB against 0.50–2.00 budgets. Yellow dashed lines show cross-mesh bridge links; subtle within-mesh edges show HyParView active-view connections.

What's next

Steady-state works. The next question is what happens under load that isn't steady — when nodes die, when links flap, when clocks drift. The chaos battery I built for Sprint 02 has been sitting unused while the CPU work happened. Next week I unleash it on the substrate I just got working.

A stylized mesh of small node circles connected by spokes, with a jagged crack running through the center in rust-orange — the chaos cut. Cream background, faint dot grid, vertical rust-orange accent bar at the left edge. Part 05 of 08
Building A Distributed Mesh in Rust · part 05
May 15, 2026

Chaos-pass replaces tests-pass

Steady-state passing isn't good enough for a substrate. I built a chaos harness with 13 primitives — kill, restart, partition, wedge, flap, clock-skew, slow-link, lossy-link, the works — and ran it against the just-stabilized mesh. Twelve tests, eight chaos-class, all green. Here's what each primitive surfaces and why "tests pass" by itself doesn't mean the substrate is sound.

After the soak fixes settled, the cluster ran flat for a week. CPU bounded, RAM bounded, peer counts stable, no nodes silently eroding. That's a green light for "the substrate works at idle," not for "the substrate is shippable." A control-plane substrate that only behaves under steady-state is the kind of thing you discover is broken six hours into an incident, when a single broker has been flapping and the rest of the cluster can't decide if it's dead.

So this week I pointed the chaos battery at it.

The principle

Sprint 02 of this project locked Golden Principle #5: chaos-pass replaces tests-pass. The idea isn't novel — Netflix put Chaos Monkey in production a decade ago, Jepsen has been making distributed systems vendors look bad since 2013. The novel part for this substrate is that the chaos battery is the same harness regardless of which feature sprint is running. Every sprint's exit criterion has to be "the existing tests pass under chaos," not "the existing tests pass."

That gives you a forcing function. A test that's green at idle but flaky under PartitionPair is broken, not flaky — the substrate has a real failure mode you're now choosing to ignore. The chaos battery is what stops you from ignoring it.

The primitives

The chaos crate (crates/mesh-chaos) ships thirteen primitives. Each one is a Rust struct implementing ChaosPrimitive with three things: an apply() that does the damage, a revert() that puts things back, and a detect() that tells the test framework what evidence to look for in the OTLP spans to confirm the substrate noticed.

PrimitiveWhat it doesWhat it surfaces
KillNodeSIGKILL one nodeConnection-drop detection; staleness pruner
RestartNodeKill + respawn same identityReconnect path; ghost-connection cleanup
BurstKillKill multiple nodes at onceQuorum/membership behavior under sudden loss
WedgeNodePause node's tokio runtimeSlow-vs-dead distinction; backpressure
DiskFullFill data dirIdentity file fsync failures; graceful degradation
PartitionPairBreak connectivity between two nodesSpanning-tree heal; alt-path routing
PartitionSubsetIsolate a subset from the restSplit-brain detection; bridge survival
FlapLinkRepeatedly up/down a peer linkReconnect storms; idle-timeout interaction
FirewallInboundDrop inbound traffic to a nodeAsymmetric failure; outbound-still-works case
ClockSkewSkew a node's wall clockStaleness comparison robustness
NatShiftSimulate NAT remappingiroh's hole-punch re-establishment
SlowLinkAdd latency to a linkBackpressure; head-of-line blocking
LossyLinkDrop a % of packets on a linkPlumtree's IHAVE retransmit path

Eight of those have corresponding test files (tests/chaos/*.rs). The other five are in the harness but not yet wired into named tests — they're available for ad-hoc cluster torture.

The run

I bootstrapped 18 nodes — the standard 2-mesh layout, with bridges — and pointed the test runner at the chaos battery. The admin-ui Tests tab is where I watched the results land.

Admin-ui Tests tab showing 12 tests in the registry across 34 reports. Card layout, each tile a test name with a passed/failed badge, run count, last duration, and outcome summary. Functional tests on top — framer-roundtrip 2 runs passed, framer-truncation 1 passed, traced-frame-roundtrip 2 passed, unknown-tag-rejected 1 passed, bi-stream-echo 1 passed at 12.3s, backpressure-stream-flood (chaos) 1 passed at 11.3s with 200 round-trips and 0 errors. Chaos tests on the bottom row — chaos-soak-9prim-1min passed in 63.9s with 7 events 7 passed, chaos-soak-9prim-5min passed in 307.7s with 28 events 28 passed, mesh-five-types-present passed in 8.1s, remove-resilience passed in 33.4s killing 3 of 6 spawned nodes and seeing all 3 survivors emit, gossip-swarm-forms passed in 33.9s with 200 received digests across 4 nodes, gossip-mesh-to-mesh passed in 56.9s with 100 cross.peer_connected spans across all services in mesh-A.

The numbers worth pulling out of that grid:

  • chaos-soak-9prim-5min ran 307.7 seconds, fired 28 chaos events, all 28 passed within the 15% flake budget. That's the headline test — 5 minutes of continuous random chaos, the substrate doesn't fall over.
  • remove-resilience killed 3 of 6 spawned nodes and verified all 3 survivors continued emitting heartbeats. The substrate notices, prunes the dead entries, keeps going.
  • gossip-swarm-forms asserts that after a clean boot 4 nodes exchange ≥200 gossip digests within 34 seconds. That's the basic "Plumtree is actually doing its job" test.
  • gossip-mesh-to-mesh verifies that 100 mesh.cross.peer_connected spans fire across all services within 57 seconds — proving the bridge architecture actually bridges.
  • backpressure-stream-flood fires 32 concurrent bi-streams of 1 KiB payloads for 10 seconds. 200+ round-trips, 0 errors. The data plane survives concurrent load.

What chaos catches that integration tests don't

The interesting one to me is the gap between the functional tests on top (5 of them — framer round-trip, frame truncation, traced-frame, unknown-tag rejection, bi-stream-echo) and the chaos tests on the bottom (7 of them, all chaos-tagged).

A functional test like framer-roundtrip answers "does the framer encode and decode correctly when nothing else is going on." That's necessary. It is not enough. The framer also has to be correct when the surrounding QUIC connection is being killed by KillNode, when the receiving node is being wedged by WedgeNode, when packets are being dropped by LossyLink. The chaos tests run the same framer code against those conditions.

backpressure-stream-flood is the cleanest example. The flood test by itself would catch "can the substrate do 200 round-trips in 10 seconds." It can. The chaos-tagged version of the same test catches "can it do 200 round-trips in 10 seconds while three random chaos primitives are firing in the background." That's a different question.

The timeline view is where chaos-period vs. steady-period events become legible. Every peer.connected event shows up with its source and target; chaos events get their own timeline rows; you can see them interleave in real time.

Admin-ui Timeline tab during a chaos-soak run. A reverse-chronological stream of events, each row with a timestamp, an event-type pill (peer.connected in green, node.ready in green, node.spawn in blue), the source node, and the target node. Dozens of peer.connected events firing within the same second cluster — bridge-40d6a647 to registry-18254465, registry-18254465 to bridge-40d6a647, bridge-40d6a647 to broker-09d7e15d, etc. — interleaved with node.ready events as nodes finish booting and node.spawn events as the admin-ui forks subprocesses. The density makes the substrate's reactivity legible: every chaos kill is followed by a wave of peer.connected re-establishments.

When the chaos harness fires KillNode, you see a node disappear from the timeline. Within a few seconds you see the surviving nodes re-establishing connections — peer.connected events fan out across the substrate as HyParView reshuffles the active view. The chaos worked; the substrate responded; both are visible.

What I'd tell a team

  • Steady-state is the baseline, not the target. If your test suite only runs at idle, the test suite isn't done. It says nothing about how the substrate behaves under the conditions you'll actually encounter in production. Idle is the easiest case; you need the hardest cases too.
  • Build chaos primitives once, reuse them everywhere. The 13 primitives in mesh-chaos are shared between unit tests, soak runs, and ad-hoc torture sessions in the admin-ui. The cost amortizes immediately. The alternative — every test author writing their own kill-the-broker helper — gives you 5 incompatible flaky helpers and no shared vocabulary about what "broken" means.
  • Make chaos visible. The admin-ui's Chaos and Timeline tabs are not just dashboards — they're the operator's view of what's failing and what's recovering. Without them, "the test passed" is a green light and nothing else. With them, you can see the substrate noticing chaos, choosing to react, and re-establishing. The visibility is the test.
  • Define a flake budget up front and stick to it. chaos-soak-9prim-5min runs with a 15% flake budget — 4 out of 28 events can be allowed to miss their detection window before the test fails. That budget is what separates "we have flaky tests" from "we have a known reliability envelope." If the substrate ever runs at 50% flake, that's an architecture problem, not a test problem.

What's next

That closes the engineering arc. The next post wraps it up — the substrate is sound, the chaos battery is green, and the whole thing is going public. Open-sourcing it next week.

A small mesh of nodes around a single rust-orange hub with spokes, accompanied by a stylized MIT-license seal in the corner. Cream background, faint dot grid, vertical rust-orange accent bar at the left edge. Part 06 of 08
Building A Distributed Mesh in Rust · part 06
May 22, 2026

Open-sourcing the Rust Distributed Mesh

Five weeks of building, breaking, and fixing a P2P mesh substrate in Rust. Today I'm pushing the whole thing public — the iroh-based transport, the gossip emit loop, the staleness pruner, the example node types, the React dashboard. Not a thing to take and run in production. A thing to read while you're building your own.

Five weeks ago I started building a P2P mesh substrate in Rust on iroh. The point was never the mesh — it was learning what it actually costs to run one before betting a real product on it. The first canary pegged an 80-core box at 100% CPU with 18 nodes idling. The fifth week's canary holds the same 18 nodes at 5%, steady-state, through a 4-day soak.

Today I'm pushing the whole repo public.

github.com/drlukeangel/rust-distributed-mesh

Full multi-mesh topology view from the admin-ui dashboard, dark theme. Two large dotted-outline regions labeled mesh-a (10 nodes) and mesh-b (12 nodes). Each region contains a ring of colored circular nodes — purple registries, blue gateways, green computes, orange brokers — with per-node frames-per-minute counters in green. Four rounded rectangles in the center column represent bridge nodes connecting the two meshes. Dense dashed yellow lines fan out from the bridges to nodes in both meshes, showing cross-mesh routing. Within each mesh region, solid blue lines mark same-mesh peer links from the HyParView active view. The whole graph is busy but shaped — not random, not full-mesh.

It is not a library you should take and run in production. It's a kit you can read while you're building your own. The bugs I paid for are now diffs you can study. The dashboard I built so I'd believe my own numbers is in there too — React, plain Vite, no framework. The flamegraph captures are in /profiles.

What's in the box

PieceWhat it does
crates/mesh-node-baseThe substrate. Identity, gossip emit loop, peer registry, LoadSampler (self-reported CPU/RAM), staleness pruner. Built on iroh-gossip 0.98.
crates/mesh-transportThin layer over iroh's Endpoint. ALPN, mDNS toggle, bind addr, 30s idle timeout.
crates/mesh-telemetryOTLP/tracing init. Every node's spans land in Jaeger; receive-time staleness is local, not sender-side.
admin-ui/React + Vite topology view. Live node grid, hub-and-leaf layout, CPU/RAM bars per node, kill button per card. The thing you stare at when you don't believe the numbers.
broker / gateway / compute / registry / bridgeExample node types. Each one is a 10-line main.rs that calls NodeRuntime::new("type").run().await. From the substrate's perspective they're interchangeable.

Stack: rust (substrate) · iroh 0.98 + iroh-gossip (QUIC + NAT traversal + Plumtree/HyParView) · tokio · opentelemetry → Jaeger · react + vite (UI).

The five engineering posts in this notebook walk through the work in order:

  1. Why I'm building a distributed mesh substrate in Rust — the architecture, the iroh choice, what I expected to break.
  2. When 18 nodes pegged my 80-core box at 100% — the first round of obvious wins (mDNS off, gossip interval up, INFO spans down). 100% → 35%.
  3. Flamegraphing your way out of "this can't possibly be right" — the join_peers storm that I almost rewrote the architecture to escape. 35% → 5%.
  4. Four days into the soak, the RAM was still climbing — the slow leaks that only a long-running soak surfaces. Ghost connections, unbounded global maps, the staleness pruner that ties it together.
  5. Chaos-pass replaces tests-pass — 13 chaos primitives, 12 tests, 5-minute soak under continuous random chaos. All green.

What the kit is, and isn't

This is a learning artifact. It is not:

  • A finished product. The streaming layer that sits on top of this substrate — the actual event-streaming app — isn't open yet. What you're reading is what's underneath it.
  • A managed iroh integration. It's an opinionated set of patterns for using iroh-gossip in a long-running process. Different from a library — closer to a scaffold.
  • Production-ready. I've run it on one box. I haven't run it across NATs, across regions, or at scale. The substrate is correct under the workload I've tested; it's not proven outside it.

It is:

  • A documented diff trail. Every fix in the four engineering posts above corresponds to a commit in the repo. You can git log your way through the optimization arc.
  • A flamegraph dataset. The 2 GB .folded files that surfaced the join_peers storm are in /profiles. Run inferno-flamegraph on them and you'll see the bug.
  • A dashboard you can read. The admin-ui code is short — maybe 800 lines of TypeScript. It joins the mesh as a passive observer and renders the topology live. Plain React, plain Vite, no state framework. Lift it if it helps.

Telemetry is the second view of the same thing. Every span lands in Jaeger; the System Architecture DAG gives you the service-to-service call pattern without leaving the browser:

Jaeger UI System Architecture DAG showing six mesh service nodes — broker, data-gateway, compute, compute-gateway, gateway, registry — connected by directed edges with call counts in the 6–27 range. The graph is sparse and legible: broker sits at the top, gateway and registry at the bottom, the data and compute paths between them. The counts are reasonable steady-state values, not the four-figures-per-second pre-fix pattern.

Numbers, before and after

MetricFirst canaryAfter five weeks
Host CPU (80-logical box, 18 nodes idle)100%5%
Per-node CPU avg0.83 cores0.05–0.10 cores
Per-node RAM avggrowing linearlybounded at ~60 MB
Stable across 4-day soakno — climbing on every axisyes
Bugs I shipped in the first versionevery single one in the four posts abovemost fixed; one or two known sharp edges remain

For comparison: a Bitcoin Core full node idles at 5–10% of one core. A Tor relay idles under 1%. The mesh substrate as it stands is competitive with both at this scale.

What I'd tell someone building one

I've said most of this in the four posts. Concentrated:

  • The protocol probably isn't the bug. Measure first. I almost rewrote to a centralized-Controller architecture before checking. The flamegraph took 15 seconds to capture and made the question moot.
  • Comments lie. Flamegraphs don't. Every CPU bug I found had a comment claiming the code was cheap. Trust the profile.
  • Soak the substrate. A 30-second canary will not catch the bugs that take 4 days to surface. If the substrate is load-bearing, leave it running overnight before you build anything on it.
  • Process-global state needs an owner. Every DashMap that lives for the process lifetime needs a clear answer to "what removes entries from this?" If the answer is "nothing, we just restart" — add a pruner before you ship.
  • #[instrument] on an infinite loop is almost always wrong. The span never closes. The event queue grows forever. Decorate the work inside the loop, not the loop itself.

What's next

The streaming layer is what comes next, built on the substrate I just spent five weeks debugging — event streaming, multi-tenant by topic, nodes across networks — riding on a foundation that now actually idles when it has nothing to do. But there turned out to be one more leak hiding in the substrate first, which is where this notebook goes from here.

If you're building on iroh, fork freely. If you find a bug I missed, open an issue. The repo will keep moving.

Many short task bars draining down to a flat baseline in ink, while one rust-orange stack keeps growing taller and never drains — the signature of spawned tasks that outlive the work that spawned them. Cream background, faint dot grid, vertical rust-orange accent bar at the left edge. Part 07 of 08
Building A Distributed Mesh in Rust · part 07
May 29, 2026

Hunting a connection leak the soak test wouldn't explain

I'd already open-sourced this thing. Then a longer, meaner soak — kill and respawn for hours, not minutes — showed RSS climbing and never coming back down, on Windows and Linux both. The bug wasn't where I'd looked before. It was a select! arm that quietly switched itself off, and two more leaks hiding behind it.

I'd already pushed this whole thing public. The mesh worked: nodes discovered each other, gossiped, survived the chaos battery. The four-day soak had already cost me five leaks and taught me to never trust a thirty-second canary. I thought I was done with this class of bug.

Then I ran a meaner soak — kill and respawn on a tight loop, the kind you leave going for hours — and watched RSS climb and never come back down. Same shape on Windows, same shape on Linux. Classic leak, and not the one I'd already fixed.

This is the story of finding it, and the three things that turned out to be wrong at once.

The trap: theorizing about the reap path

The mesh runs an iroh QUIC transport under a HyParView/Plumtree gossip layer. Under churn, peers join and leave the active view constantly, so connections are born and reaped all day long. Every reap path I traced should have worked. The connection loop returned. close() got called. The maps got pruned. I spent hours reading code that, on paper, released everything it touched.

This is the same hole I fell into before the flamegraph post — sitting there reasoning about what the code should do instead of measuring what it did. The lesson refuses to stick the first time, or the second: when a leak resists reasoning, stop reasoning and instrument it.

The breakthrough: count, don't infer

So I stopped reading and added two counters to the gossip actor — connection-loop tasks spawned versus tasks finished — and ran a short soak. The numbers ended the debate:

spawned = 174, finished = 52   →  122 connection-loop tasks stuck alive

Meanwhile the membership map sat bounded at ~15 peers, exactly as it should. So roughly 107 connection-loop tasks had no peer state behind them and yet had never exited. That contradiction — live tasks with nothing to serve — pinned the bug to one place instead of the whole gossip layer.

A spawned-versus-finished tally for connection-loop tasks. The spawned bar reads 174; the finished bar reads 52; the gap of 122 is highlighted in rust-orange and labelled "stuck alive." A separate small bar shows the membership map holding steady at about 15 peers. The point the diagram makes: tasks are accumulating far faster than they retire, while the thing they are supposed to track stays flat — so the leak is in task lifetime, not in peer state.

Root cause #1: a select! arm that switched itself off

Here's the send loop, simplified to the part that mattered:

tokio::select! {
    _ = &mut closed => break,
    Some(msg) = self.send_rx.recv() => self.write_message(&msg).await?,
    // ...
}

When a peer leaves the active view, its send_tx is dropped and recv() starts returning None. The trap is that the Some(msg) = … pattern doesn't deliver that None to me — when the pattern fails to match, select! disables that branch for the rest of the loop. The branch goes dark.

The only other long-lived arm, _ = &mut closed, stays pending forever, because nothing has actually closed the connection — that was supposed to happen because the send loop noticed the peer was gone. So the loop parks on a future that can never resolve. The send task hangs, the connection loop that owns it never completes, and the QUIC Connection and its driver task are stranded. One leaked connection for every peer that leaves — and the nodes that rotate peers the most leak the fastest.

The fix is to stop pattern-matching the channel closed away and handle the None myself:

msg = self.send_rx.recv() => match msg {
    Some(msg) => self.write_message(&msg).await?,
    None => break, // all senders dropped -> peer gone -> tear down
},

Stuck tasks went from 122 and climbing to a bounded ~15. This is a cousin of the reconnect leak from the soak post — both are a task outliving the thing it was serving — but the mechanism is different and nastier, because the code looks like it handles shutdown. The closed arm is right there. It just never gets a chance to fire.

Two states of the same select! loop. On the left, the broken version: an active peer feeds the send_rx channel and the Some(msg) arm runs normally, the closed arm waiting in reserve. On the right, after the peer leaves: send_tx is dropped, recv() yields None, the Some(msg) pattern fails to match so select! greys out that whole arm, and the only arm left — closed — stays pending forever because nothing closed the connection. The loop is parked on a future that can never complete, stranding the QUIC connection. The fix, shown beneath, replaces the Some(msg) pattern with a plain bind plus an explicit None => break.

Root causes #2 and #3, because leaks travel in packs

With connections bounded, a heap profile still grew — slower, but up and to the right. Two more, both smaller, both mine:

  • Telemetry retention. The OpenTelemetry tracing layer was floored at DEBUG. Under churn the network stack emits a debug firehose, and tracing-opentelemetry appends every captured event to the currently-active span's buffer — which is only freed when that span closes. My long-lived actor spans never close. So their event buffers grew without bound. One single 2 MB allocation in the profile turned out to be one span's event vector. The fix was a one-liner: floor the export layer at INFO.
  • Process-table enumeration. The per-node load sampler built its system handle by enumerating every process on the box, every tick — tens of thousands of transient name strings on Windows, every couple of seconds. It only ever needed our own process. The fix: don't enumerate the world; sample only our own pid.

Neither of those is exotic. Both are the kind of thing that compiles, reads fine in review, and costs you megabytes an hour in production.

The payoff

Measured with dhat, before and after, under the same kill-and-respawn soak:

bucketbeforeafter
total retained heap42.9 MB8.1 MB (↓81%)
tracing / otel29.2 MB2.0 MB
sysinfo9.1 MB1.6 MB
quic connectionsgrowingbounded

A before-and-after heap breakdown as paired horizontal bars. Before: total retained heap 42.9 MB, split into a large tracing/otel band at 29.2 MB, a sysinfo band at 9.1 MB, and a quic-connections band marked "growing." After: total retained heap 8.1 MB, with tracing/otel down to 2.0 MB, sysinfo down to 1.6 MB, and quic-connections marked "bounded." The after bar is roughly a fifth the length of the before bar; the reduction is labelled 81 percent.

And the thing that actually matters — RSS troughs under sustained chaos went from a monotonic climb to a flat plateau, on Windows and Linux both. Over a 35-minute soak, an observer node's RSS at 140 chaos events dropped from 0.228 GB to 0.115 GB — about 66% lower — and dhat put the QUIC connection bucket at 18.7 MB → 3.85 MB.

Where the bugs actually lived

Here's the part I had wrong going in. I assumed — the way you always do — that the bug was in my code, not the library. The two small ones were: the telemetry floor and the process-table sampler were my config, one-line fixes. But the connection leak itself, the dominant one, was in the stack, and tracing it produced a cluster of fixes I submitted upstream to three crates:

  • The gossip layer got the most. The select! SendLoop footgun above; making connection_loop exit (it ran send and receive under join!, which waits for both — but the receive half blocks forever on accept_uni() when a peer leaves locally, so I moved it to select!); and pruning the per-peer state that outlived removed peers — peer_topics, peer_data, the lazy_push_queue on NeighborDown. Plus regression tests so the leak can't creep back.
  • iroh itself had a per-remote address cache (AddrMap behind mapped_addrs) with no eviction path — it grew once per remote ever seen under churn. The clean-shutdown path now evicts the departing remote's cached addresses.
  • The QUIC layer underneath leaked a whole connection task, packet spaces, and channels whenever a Connecting was dropped before its handshake finished — and before the handshake there's no idle timeout to eventually reap it. The fix was an impl Drop for Connecting that drains and releases.

Two days of soak-and-profile to find them; the diffs themselves are tiny. That's the usual ratio for a leak that only shows up under sustained churn — the finding is the work, the fix is a few lines.

A thank-you to the people who built this

I want to stop and say this plainly, because it's easy to skip past: I got to find these at all only because the whole stack is open. I'm building on iroh and iroh-gossip — peer-to-peer QUIC, NAT traversal, hole-punching, a relay tier, Plumtree/HyParView gossip — none of which I could have written myself in a reasonable lifetime. It's built by the team at n0 (github.com/n0-computer), and the quality of it is the reason my "substrate" is a few hundred lines instead of a few hundred thousand.

And here's the part that still feels lucky every time: when I did hit real bugs deep in that stack, I could read the exact code, instrument it, prove the fault, and send a fix back — and there's a real, responsive community on the other end to receive it. That's not how it goes with a closed black box, where the best you can do is file a ticket into the void and build a workaround. The n0 folks have done years of genuinely hard systems work — the kind where a single select! arm or a missing Drop is the difference between flat and climbing memory — and they gave it away so the rest of us can stand on it. An enormous high-five to that whole team. We are extraordinarily fortunate to have makers like this, working in the open, on infrastructure this good. Thank you.

What I'd tell a team

  • Instrument before you theorize. A spawned-versus-finished counter found in minutes what hours of reading the reap path missed. If a resource leaks, count the thing being created and the thing being destroyed before you reason about why.
  • Some(x) = expr in a select! arm is a footgun whenever expr can legitimately yield None. The failed match disables the branch instead of surfacing the close. Bind the value plainly and match it yourself.
  • Leaks travel in packs. Fixing the dominant one just unmasks the next. Profile again after every fix — the flat line you were hoping for is usually one more leak away.
  • Telemetry is not free. Events recorded inside a never-closing span live exactly as long as the span. A long-lived actor span at DEBUG is an unbounded buffer wearing a tracing label.
  • Sometimes it is the library — and that's a contribution, not a complaint. I went in assuming the bug was mine, because it usually is. This time the dominant leak was in the stack itself, and the right ending wasn't a workaround in my code — it was a handful of small fixes submitted upstream so nobody else hits it. A leak found under your churn is worth fixing at the source.

What's next

The substrate is finally flat under churn — for real this time, measured, on both platforms. The fixes are upstreamed and the foundation holds. Which means I can stop poking at the substrate's memory behavior and turn back to the thing this notebook is actually about: making a multi-mesh fabric observable and correct, one sprint at a time.

A single node emitting a vertical chain of short telemetry spans as it boots, beside a second cluster of nodes, with a dashed cross-mesh link reaching the second cluster directly — no separate bridge node in between. Cream background, faint dot grid, vertical rust-orange accent bar at the left edge. Part 08 of 08
Building A Distributed Mesh in Rust · part 08
Jun 02, 2026

Watching a node boot, then a second mesh with no bridge

The substrate gets a second life as a multi-mesh fabric — and the rule from day one is telemetry is the substrate, not a feature. A node that does work without leaving a trace is a bug. Here's a node naming itself, booting as a span chain you can read in Jaeger, building a topology directory out of gossip — and then a second mesh reached with no bridge node at all. Every claim is a screenshot of the live system.

The performance work earlier in this notebook left me with a substrate that idles when it has nothing to do and stays flat under churn. That was the foundation. This post starts the thing it was a foundation for: a small, observable multi-mesh fabric — named nodes in named meshes, a gossiped directory so any node can find any other, and not one line of bespoke infrastructure I could get from the library instead.

The rule I set on day one and never relaxed: telemetry is the substrate, not a feature. A node that does work without leaving a trace is a bug. So before there was any "product," there was a boot-span chain and a Jaeger instance to read it in.

A node names itself

When you spawn a broker into mesh1, it doesn't get a random hex id and it isn't named by some central authority. It loads (or mints) its own identity, then self-names from it:

node_name = <mesh>.<type>.<first-6-hex-of-node-id>     e.g.  mesh1.broker.239dc9

The mesh is required — an unlabelled node fails fast — the type is what it is, and the suffix is the first 6 hex of the node's public key, so the friendly name is eyeball-matchable to its unique id. No registry hands out names; identity is the name.

The boot is a span chain

Every boot is one trace rooted at node.ready, with the bring-up steps as children — endpoint_created → alpn_registered → gossip_started → accept_loop_started, each a few hundred microseconds, all under one root. If a node is misbehaving, the trace tells you exactly how far it got before it stalled.

A boot trace in Jaeger: a single trace named broker rafka.mesh.node.ready spanning about 1.8 ms, with four child spans nested under it in sequence — endpoint_created, alpn_registered, gossip_started, accept_loop_started — each a few hundred microseconds. The waterfall makes the bring-up order and timing legible at a glance.

This is the spine. Get the boot chain visible for one node and you can always answer "did it start, and how far did it get" without attaching a debugger.

A topology cache, built from gossip

Each node broadcasts a small digest on its mesh's gossip topic, including its reachable address. Every node accumulates those into a process-local directory — name → {mesh, type, location} — surfaced as a Cache view. This is the thing a node consults to answer "where do I send to reach X?" There's no second gossip system and no database: the cache is the gossip digests, surfaced.

The admin console's Cache tab: a gossiped topology directory listing each node as name, mesh, type, location, and node-id prefix. Two entries are shown for a single-mesh bring-up — the admin console itself and mesh1.broker1 at a real loopback address — with a note that it updated 0.0s ago, live.

A second mesh — and deleting the bridge

The earlier design had a bridge: a special node that joined two meshes' gossip and shuttled awareness between them. It's the obvious first idea, and it's the wrong one — it's a bespoke piece of mesh infrastructure, and the substrate's whole premise is don't build mesh infrastructure, use the library. So the bridge had to go: out went the bridge node type, its spawn button, its env knobs, the whole crate. Two meshes now run side by side with no node whose only job is to connect them.

The topology view with two meshes side by side — mesh1 and mesh2, each its own swim-lane of colored node cards (broker, gateway, registry, compute, console) — and edges drawn directly between the nodes that actually talk. There is nothing in the middle: no bridge node, just two meshes and direct cross-mesh links.

Per-mesh gossip lives on a per-mesh topic (blake3(mesh_id)), so a mesh1 node doesn't see mesh2 by default. The interim answer: the gateway — the node that writes across meshes — also subscribes to the other mesh's gossip, so its directory spans both. A simulated writer on mesh1.gateway resolves mesh2.broker from that directory and sends to it. The proof isn't a line in a log; it's the cross-mesh trace stitching end to end:

A cross-mesh produce trace in Jaeger: a broker boot-span chain rooted at rafka.mesh.node.ready with identity_loaded, endpoint_created, alpn_registered, gossip_started, and accept_loop_started children, captured on the second mesh — proof that the node on the far mesh booted and is emitting telemetry into the same trace backend.

The catch (named, not hidden)

"The gateway subscribes to the other mesh's entire gossip" works for two meshes on one host. It does not scale: with many meshes and hundreds of gateways it's an O(meshes²) firehose of every remote node's per-tick digest, and it quietly forces the console to special-case itself as an all-mesh subscriber. That's a real debt, and the next post pays it down with a proper control-plane backbone — after first fixing something more embarrassing: the traces themselves were lying.

What I'd tell a team

  • Make telemetry the floor, not the polish. If "does this node work" can only be answered by reading its logs by hand, you'll be reading logs by hand forever. A boot-span chain is cheap and it's the thing you'll lean on every single debugging session after.
  • Let identity be the name. Self-naming from the public key means no allocator, no name collisions, no central authority to be down — and a friendly id you can still match to the real one by eye.
  • Refuse the bespoke node. A "bridge" felt necessary and wasn't. Every special node type is infrastructure you now own and operate; reach for the library's primitive before you invent one.