Building A Distributed Mesh in Rust
I wanted to feel where a self-organizing P2P mesh in Rust breaks before I trusted it with anything. iroh-gossip over QUIC, run small and run hard. The first 18 nodes pegged an 80-core box at 100% CPU; weeks of flamegraphs and soak tests later the same 18 idled at 5%. This notebook is the work as it happened — the wrong hypotheses, the flamegraphs, the leaks (some mine, some upstream — a churn leak I chased for two days turned into fixes across three crates), and then the harder build: a multi-mesh fabric you can actually see.
I wanted to know what it actually costs to run a self-organizing peer-to-peer mesh on commodity hardware — where it strains, where it leaks, and whether I could trust it under sustained churn. The build is Rust: an iroh-gossip layer over QUIC, nodes addressed by public key, NAT traversal handled by the fabric. I built it small — a handful of node types on one box — and ran it hard enough to break.
It broke. The first canary pegged an 80-core box at 100% CPU with 18 nodes sitting idle. Weeks of measuring, theorizing, and being wrong got it down to 5%. Most of the bugs were mine — a tokio::spawn whose task nobody owned, a DashMap that grew forever because nothing was told to prune it. But the one I chased hardest, a connection leak under sustained churn, turned out to live in the stack itself: a select! arm in the gossip layer that silently disabled itself, an address cache with no eviction, a QUIC handshake that leaked when abandoned. Two days of soak-and-profile, and the ending wasn't a workaround — it was a cluster of small fixes submitted upstream to three crates.
This notebook is the work in order, as it happened — the first canary, the flamegraph that found a self-inflicted handshake storm, four days of soak that surfaced five leaks at once, a chaos battery that replaced "tests pass," open-sourcing the result, the two-day churn-leak hunt that went upstream — and then the harder build: turning it into a multi-mesh fabric you can actually operate. Named nodes in named meshes, a gossiped topology cache, cross-mesh delivery with no bespoke bridge, a relay that carries traffic when there's no direct path, and traces honest enough to trust. Every claim in those later posts is backed by a screenshot of the live system.
Why I'm building a distributed mesh substrate in Rust
Before I build anything on top of a peer-to-peer substrate, I need to know whether the substrate itself is sound. The choice is iroh-gossip over QUIC. The first canary is 18 nodes on one box. Here's what I'm trying to learn and what I expect to break.
What I'm ultimately after is an event-streaming system whose nodes can sit on different networks — different cloud regions, different colos, a laptop on a coffee-shop Wi-Fi — and form a working cluster anyway. Direct-TCP-on-a-VPC isn't enough for that. The transport layer has to handle NAT traversal, identity, mesh formation, and reconnect for me. But the streaming layer is a problem for later; first I need the substrate underneath it to be sound, and this notebook is about the substrate.
I picked iroh for the substrate. QUIC under the hood, relay tier for cross-NAT, hole-punching where it can. The application layer on top is iroh-gossip — Plumtree for eager broadcast, HyParView for membership. The combination gives me a self-organizing peer set without writing any of it myself.
The first thing I'm going to do is run 18 of these on one machine and watch what happens.
Why iroh, not direct TCP
The conventional design does the simplest thing that works: direct TCP between nodes, a metadata store like ZooKeeper or etcd, every node knowing every other by hostname. That works because the nodes live in a single network where hostnames resolve and ports are open.
The deployments I want to support don't look like that. A node behind a residential NAT. A compute box in someone's homelab. A gateway in AWS. They can't reach each other on direct TCP. They CAN reach each other via QUIC + a relay, with hole-punching closing the hop where possible. That's exactly what iroh does, and I'd rather use a maintained library than build it.
What iroh buys me concretely:
- Identity from a keypair, not a hostname. Every node has an Ed25519 keypair; the "address" is the public key. The same key from any IP works.
- NAT traversal. STUN-style probing, hole-punching, relay fallback. I don't have to think about it.
- QUIC transport. Multiplexed streams over one connection, 0-RTT reconnect, no head-of-line blocking. Better defaults than tuning TCP.
- Discovery primitives. mDNS for LAN, DHT for internet-scale. Optional and pluggable.
The cost: more bytes per packet than raw TCP (TLS 1.3 framing + QUIC headers + congestion-controller state per connection). For a substrate that's going to broadcast small heartbeats, that's the right trade.
What iroh-gossip is for, and what it isn't
iroh-gossip runs Plumtree + HyParView on top of iroh's QUIC. Plumtree forms a spanning tree across subscribed peers for eager broadcast; HyParView keeps the per-node active connection set small and roughly constant (~5–7 peers regardless of cluster size). That's how Bitcoin scales to 50,000 nodes without every node connecting to every other.
The thing I want to be careful about is what I broadcast. iroh-gossip is a control-plane primitive — it's designed for "the cluster's membership just changed" or "a new topic appeared," not "here is my full state, every two seconds, forever." Bitcoin doesn't broadcast every node's status every 2 seconds. It broadcasts transactions when they arrive. Big difference.
I have a hunch this is where I'm going to get bit. I'm setting up the gossip emit to fire on a 500ms timer with a GossipDigest that includes peer counts, frame counts, CPU/RAM. That's a lot of state on a fast clock. We'll see.
The architecture, briefly
Three layers, with deliberate separation:
| Layer | Crate | Responsibility |
|---|---|---|
| Transport | crates/mesh-transport | iroh Endpoint setup, ALPN, bind addr, mDNS toggle |
| Substrate | crates/mesh-node-base | Identity, peer registry, gossip emit loop, LoadSampler (self-reported CPU/RAM via sysinfo), staleness handling |
| Telemetry | crates/mesh-telemetry | OTLP/tracing init, every node's spans land in Jaeger |
On top, five example node types — broker, gateway, compute, registry, bridge. From the substrate's perspective they're interchangeable; the type is just a string. Each one is a 10-line main.rs that calls NodeRuntime::new("type").run().await and supplies a .env.dev preset for its CPU/RAM budget.
There's also an admin-ui — a React + Vite dashboard that joins the mesh as a passive observer and renders the topology live. Hub-vs-leaf is visible in the layout. Each node card shows its CPU/RAM utilization against its declared budget. That's the surface I'll use to feel what the substrate is doing.
Telemetry is built in from boot. The admin-ui has a Boot Waterfall view that decomposes the spans every node fires on startup — endpoint creation, ALPN registration, gossip subscribe, accept loop. The shape of a healthy boot is five short spans nested under a mesh.node.ready root, sub-millisecond each on this machine. When something goes wrong at boot, you see which span stretched.

What I expect to break
I'm writing this down on purpose so I can be honest about the hypotheses going in:
- Gossip volume. At 18 nodes broadcasting every 500ms, that's 36 broadcasts/sec. Plumtree fans each one out across the spanning tree. I expect to find that 500ms is too aggressive for steady-state health.
- OTLP overhead. Every span we emit becomes a protobuf-encoded gRPC frame to Jaeger. If I'm not careful about what fires at what level, the telemetry will cost more than the work it's measuring.
- Connection accounting. Peers will reconnect over time. I expect there's a bookkeeping bug somewhere — registries that grow without pruning, connections that close without their tasks knowing. I haven't found it yet.
What I don't expect — and would be surprised by — is iroh itself being expensive in some fundamental way. The library is maintained by people who deal with this for a living.
What I'd tell someone starting
- Pick the substrate first. The choice of transport (raw TCP, gRPC, iroh, libp2p, …) determines everything else. Don't pick the wire format before you've picked the network.
- Don't broadcast state on a clock unless you've measured what it costs. Most "heartbeat" patterns assume small clusters, small payloads, slow cadences. Two seconds with a 200-byte digest at 18 nodes is already 36 broadcasts/sec.
- Build the dashboard first. Or at least the topology view. You're going to be looking at this thing constantly while you debug it, and a
print!doesn't compose into a mesh layout.
What's next
Tomorrow I bootstrap 18 nodes on my workstation and see what they do at idle. The plan: spawn one of each type, then 17 more, let them gossip for a minute, look at the numbers. The next post in this notebook is what those numbers were and what I did about them.
When 18 nodes pegged my 80-core box at 100%
First bootstrap of 18 mesh nodes on an 80-logical-core workstation. Host CPU pegged at 100% the moment the bootstrap finished. Three obvious things to try first — mDNS off, gossip interval up, per-frame INFO spans down — got it from 100% to 35%. Still wrong. The real bug was somewhere else.
Bootstrapped the cluster. 18 nodes — two of each role per mesh, across two meshes, with two bridges. Hit Bootstrap at 14:51 local. Host CPU at 14:52 was 100% across all 80 logical cores. Five samples a second apart, all 100%.
The mesh worked. /api/topology came back with 19/19 live (the admin-ui plus 18 children). Each node's GossipDigest was arriving. The Topology view in the dashboard painted. Bytes were moving. Nothing was crashing.
It was just consuming the entire machine to do that.
What the first reading said
I'd expected ~10% host load. The two Xeon Gold 6148s in this box are not a small machine — 80 logical processors, hyperthreaded across two sockets. A handful of small Rust processes broadcasting 200-byte digests every 500ms should not be pegging the entire system. My mental model of "iroh-gossip at 18 nodes" was Bitcoin-territory: single-digit % per node.
The actual reading per-process via Task Manager:
| Node | CPU % | What I expected |
|---|---|---|
mesh-broker.exe × 4 | 0.66–1.52 each | ~0.05 |
mesh-gateway.exe × 4 | 0.72–1.32 each | ~0.05 |
mesh-compute.exe × 4 | 0.67–0.86 each | ~0.05 |
mesh-registry.exe × 4 | 0.63–1.80 each | ~0.05 |
mesh-bridge.exe × 2 | 0.82–0.83 each | ~0.05 |
mesh-admin-ui.exe × 1 | 3.44 | ~1.0 |
Add the column up: ~18% of the 80-logical box, before counting iroh-quinn's kernel-side work. Performance Monitor's \Processor(_Total)\% Processor Time showed 95–100% sustained. Something else was eating the headroom.

The shape of the cost — every process burning roughly the same amount regardless of how many peers it had — pointed at per-tick work rather than per-peer work. So I started with three suspects that fire on a clock.
Three obvious things to try
1. Turn off mDNS
iroh-mdns was discovering every node on the local network and adding it to the active peer set. On a single box, that means 17 mDNS-announced peers per node — each one getting a QUIC handshake, each one getting added to HyParView's active view (which should be ~5, not 17). I had a hunch that the cluster was forming a full mesh rather than a sparse spanning tree.
MESH_MDNS_ENABLE=false, with explicit seed nodes injected at spawn time (admin-ui as the universal seed, plus 1–2 already-spawned same-mesh peers). Each child boots, dials its seeds, and lets HyParView shape the rest from there.
2. Slow the gossip cadence
MESH_GOSSIP_INTERVAL_MS was 500. At 18 nodes that's 36 broadcasts/sec across the cluster. For a control-plane heartbeat, that's overkill — Kafka's KRaft heartbeats every 1–3 seconds and considers a broker dead after 9 missed cycles. There's no reason substrate health needs 2 Hz granularity.
Bumped to 2000ms. 9 broadcasts/sec cluster-wide. Plumtree's IHAVE retransmits handle any actual loss.
3. Demote per-frame INFO spans to TRACE
This one I caught by reading the stdout. Every received gossip digest fired a tracing::info_span!("mesh.gossip.received", ...). With Plumtree's eager-push fanout, each digest arrives at a node ~17 times (once per peer in the active view, before lazy IHAVE deduplicates). At 36 broadcasts/sec × 17 fanout × N nodes, the cluster was producing ~2,600 INFO-level events per second.
Each event goes through tracing-subscriber's formatter (stdout write), then tracing-opentelemetry's layer (build a protobuf span, push to the OTLP batch queue, eventually send to Jaeger over gRPC). That's not free.
INFO is for state transitions — peer connected, peer disconnected, gossip topic subscribed. Per-frame events should be TRACE so they're filtered before any of that fires.
// before
tracing::info_span!("mesh.gossip.received", ...)
.in_scope(|| info!(...));
// after
tracing::trace_span!("mesh.gossip.received", ...)
.in_scope(|| tracing::trace!(...));
The reading after
Host CPU dropped from 100% to ~35%. Per-node CPU dropped from 0.5–1.5 cores down to roughly 0.25 cores. The log volume cratered — admin-ui's stdout went from 261,000 lines in 50 seconds to about 41,000.
That's a real improvement. It's also still wrong. A self-organizing P2P mesh of 18 idle nodes should not be using a quarter of a logical core per node. Bitcoin nodes idle at single-digit percent of one core, not 25% of one.
The hypothesis I started forming: "iroh-gossip itself is just expensive at this scale, the architecture is wrong, we should pivot to a centralized Controller." I spent two days seriously sketching the Controller architecture — a single coordinator, every node holds one connection to it, hub-and-spoke at the protocol layer. It would have worked. It would have been a much bigger change.
Before I committed to it, I decided to flamegraph the running cluster first.
That's the next post.
What I'd tell a team
- Symptoms that look like "X is fundamentally too expensive" usually aren't. They're usually "you're doing X on a faster clock than you measured." Slow the clock before you blame the protocol.
- Comment the cadence on every periodic loop. The 500ms gossip interval was a placeholder I never revisited. A code comment claiming "// every 500ms is fine, gossip is cheap" would have been a lie regardless of intent. Better: don't claim it's fine, claim what you measured.
- INFO is for state transitions. TRACE is for per-frame events. DEBUG is for the boundary between those two — "useful when investigating, noisy in steady state." If you can't draw the line cleanly, your spans are doing too much.
What's next
The 35% reading was the trap. It looked like I was making progress, and it was real progress — but it convinced me the remaining cost was structural rather than a bug. The next post is the flamegraph that showed me how wrong I was, and the one-line fix that took the cluster from 35% to 5%.
Flamegraphing your way out of "this can't possibly be right"
Two days into sketching a centralized-controller rewrite, I took a flamegraph instead. The hottest function in the mesh wasn't anything in the gossip protocol. It was an idempotent peer-join call I was making 10 times a second per peer — generating 3,240 QUIC handshakes per second across the cluster, doing exactly nothing useful.
I was two days into sketching a Controller architecture for the mesh. The reasoning went: 18 nodes at 35% host CPU after the obvious wins meant iroh-gossip itself was just expensive, the architecture was wrong, and the right move was a hub-and-spoke pattern with a coordinator that every node holds one connection to. A central sequencer instead of a peer-to-peer swarm. The model that's supposed to be the safe choice.
Halfway through the spike I noticed I hadn't actually profiled anything. The case for the Controller rewrite was built on a measurement gap, not a measurement. I closed the editor, added tracing-flame = "0.2.0" to the workspace, captured 15 seconds of flame data, and rendered it.
The graph was lopsided in a way that didn't fit my model.
What the flamegraph said
The dominant function in every node's CPU profile was iroh::endpoint::connect. Not Plumtree dissemination. Not packet decode. Not OTLP export. The thing every node was spending most of its time on was opening QUIC connections — at thousands of calls per second, on an 18-node cluster where nothing was disconnecting.
That should not be happening at all.
I went looking for who was calling endpoint.connect() in the hot path. The chain led to a sender.join_peers(peer_ids) call inside run_gossip's 100ms tick loop. There was a comment right next to it claiming:
// Feed mdns-discovered peers to gossip so the swarm forms. // join_peers is idempotent — calling every tick with the current peer // registry is cheap.
The comment had two correct words and one fatal one. "Idempotent" was true — calling join_peers with the same set of peers leaves the gossip state unchanged. "Cheap" was a lie. Idempotent in outcome says nothing about cost: under the hood, iroh interprets each call as "establish a transport connection to each of these peers," which kicks off a fresh TLS 1.3 handshake on top of QUIC's 1-RTT setup. Every time. For every peer. On a 100ms timer.
At 18 nodes with ~5 peers each in the active view: 18 × 5 × 10/sec = 900 handshake attempts per second per node, 3,240 cluster-wide. All landing on already-established connections that didn't need re-establishing. All doing exactly nothing useful for the gossip protocol.
The fix
A HashSet<String> of peers I've already told gossip about. Only call join_peers for entries that aren't in the set.
let mut joined_peers = HashSet::new();
loop {
tokio::select! {
_ = tick.tick() => {
let mut new_peers = Vec::new();
for peer in registry.iter() {
if !joined_peers.contains(peer.key()) {
if let Ok(id) = iroh::EndpointId::from_str(peer.key()) {
new_peers.push(id);
joined_peers.insert(peer.key().clone());
}
}
}
if !new_peers.is_empty() {
let _ = sender.join_peers(new_peers).await;
}
// …rest of gossip emit loop
}
}
}
One commit. Rebuild. Re-canary.
Host CPU dropped from 35% to 6%. Per-node steady-state dropped to 0.05–0.10 cores. The mesh kept gossiping, the topology view kept painting, 19/19 stayed live. The Controller rewrite I'd been sketching became unnecessary in the time it took the canary to settle.

Four more bugs in the same pass
While I was in there, the flamegraph and the diff surfaced four more:
Ghost tasks under #[instrument] on infinite loops. Two background loops (run_ping_sender, watch_mdns) were decorated with #[instrument(skip_all)]. On a normal short-lived async function, that creates a span that closes when the function returns. On an infinite loop, the root span never closes, and every event inside it gets appended to the span's child-event queue forever. OpenTelemetry's batch processor walks that queue on every export tick — over time, the walk is the cost. Removed the #[instrument] macros from the loop functions. Their inner spans still exist.
A busy-wait on a closed channel. Inside another tokio::select!, when the gossip event channel closed, the arm did continue — which made the task immediately re-poll, which immediately yielded None again, on a tight loop with no yield point. Changed to break. The task exits cleanly on channel close instead of spinning at 100% of one core.
Ghost QUIC connections on peer reconnect. When a peer disconnects and reconnects with the same identity, reg.insert(peer_id, new_conn) overwrites the registry entry without closing the previous Connection. iroh-quinn allows multiple parallel connections from the same identity, so the old one just sits there alive. The run_frame_reader and run_bi_echo_reader tasks holding the old Connection keep parking on accept_uni() / accept_bi() forever. Over hours of churn, dozens of dead connections accumulate per node. Fix: close the old connection before overwriting.
if let Some((_, old_conn)) = reg.remove(&peer_id) {
old_conn.close(0u32.into(), b"superseded by new connection");
}
reg.insert(peer_id.clone(), conn.clone());
Stop-removing-from-sets. joined_peers never shrank. mesh_id_registry only inserted, never removed on disconnect. Both were minor compared to the join_peers storm, but both are real leaks over a long-running cluster. Added joined_peers.retain(|p| registry.contains_key(p)) on each tick, and mesh_id_registry.remove(&peer_id_str) next to the existing registry.remove in the disconnect handler.
What I'd tell a team
- Comments lie. Flamegraphs don't. Every CPU bug I found had a comment next to it claiming the code was cheap. The comment described intent, not cost. If you're inheriting a codebase, treat every "this is cheap" comment as a hypothesis to verify.
- Don't blame the protocol before you've measured. I came within a day of forking the architecture to a Controller pattern. The reasoning was internally consistent: gossip-fanout costs grow with peer count, we have a peer-count problem, therefore replace gossip. The flamegraph showed it had nothing to do with peer count — it was a 100ms tick doing the same expensive thing repeatedly. The architecture wasn't the bug.
#[instrument]on an infinite loop is almost always wrong. Spans are scoped to function lifetime. An infinite loop's span lives forever. Use#[instrument]on the work inside the loop, not on the loop itself.- Idempotent ≠ cheap. "Calling this multiple times has the same effect" says nothing about the cost of the calls. Especially in network code —
endpoint.connect()is idempotent from the application's perspective but does a full TLS handshake every time.
For a second-channel sanity check, the Jaeger service-architecture DAG agreed with the dashboard. Service-level call counts looked reasonable for the workload, not the four-figures-per-second I'd been seeing pre-fix:

What's next
The 6% reading held for an hour. I left it running overnight to soak. The next morning it was at 100% again.
Different bug. Slower. Worse, because it took an overnight run to surface. Next post.
Four days into the soak, the RAM was still climbing
Left the cluster running. Per-node CPU at 5% was real. The leak was somewhere else. Over 4 days of soak, 18 nodes climbed from 1 GB total RAM to 12 GB — and the worst offenders were nodes with zero active peers, holding the most state. Five accumulating data structures, no pruner, the time-tested pattern of "we'll clean it up later" never quite getting cleaned up.
After the join_peers storm fix the canary settled. Host CPU 6%, per-node CPU 0.05–0.10 cores, mesh stable, 19/19 live. I left it running overnight, expecting to come back to roughly the same numbers in the morning.
It came back to 100%.
Five days later, with the soak still going on the same processes, the numbers looked like this:
| Metric | t=0 fresh | t+32min | t+105h (now) |
|---|---|---|---|
| Total CPU (sysinfo sum) | 1.38 cores | 2.59 cores | 14.90 cores |
| Total RAM | 1.04 gb | 1.32 gb | 11.87 gb |
| Per-node avg CPU | 0.077 | 0.144 | 0.83 |
The previous post celebrated 0.05–0.10 cores per node. Four days later the average was 0.83. Eleven times worse. The cluster wasn't crashing; it was eroding.
The fingerprint that didn't fit
The shape of the cost was the giveaway. Looking at top consumers after four days:
| Node | Peers | CPU | RAM |
|---|---|---|---|
compute-8ac4eca1 | 1 | 2.57 cores | 2.35 gb |
gateway-16bfa75a | 0 | 1.88 cores | 1.64 gb |
compute-fbedd6ed | 0 | 1.39 cores | 1.11 gb |
broker-d8329b3a | 0 | 0.94 cores | 0.64 gb |
The nodes burning the most CPU had zero peers. That's a contradiction in a healthy mesh — a node with no peer connections should be idle. Either the work isn't happening on connections at all, or the connections aren't being counted right.

Both turned out to be true. The work was happening on connections the peer count didn't know about — closed-but-not-cleaned Connection handles still pinned by background tasks — and on global maps that had been growing for 105 hours without anyone telling them to shrink.
Five leaks, in order of impact
1. Ghost QUIC connections on peer reconnect
iroh-quinn allows multiple parallel connections from the same identity. When the same peer reconnects after a network blip, the application sees a new Connection; the old one just sits there alive until something explicitly closes it. The accept loop was doing this:
reg.insert(peer_id.clone(), conn.clone()); // overwrites registry tokio::spawn(run_bi_echo_reader(conn_bi)); // holds clone of new conn run_frame_reader(/* ... */, conn /* moved */); // holds new conn
What it never did was close the previous Connection before overwriting the entry. The old run_frame_reader task was still parked on accept_uni() of the old conn, which would never error because the old conn was never closed. Same with run_bi_echo_reader and accept_bi(). They sat there forever, holding Connection clones and keeping iroh-quinn's per-connection state (congestion controller, TLS session, packet pacer) live.
Fix:
if let Some((_, old_conn)) = reg.remove(&peer_id) {
old_conn.close(0u32.into(), b"superseded by new connection");
}
reg.insert(peer_id.clone(), conn.clone());
Connection::close causes both accept_uni and accept_bi on the old conn to return Err, the old tasks break out cleanly, the iroh-quinn state drops.
2. live_digests and topic_membership grow forever
Two process-global DashMaps. live_digests is keyed by node_id with the latest GossipDigest received from that node. topic_membership is keyed by topic-label with a set of node_ids ever seen on that topic. Both are populated on every received digest. Neither had a pruning mechanism.
Every cluster respawn (and there had been many over the week of debugging) added entries with new node_ids — admin-ui pre-mints a fresh keypair per spawn, so every restart creates a new identity. Old identities never broadcast again. Their entries stay forever.
Fix: a single background task on a 5-second timer that scans live_digests, drops entries whose wall_time_ms is older than MESH_STALENESS_MS (default 30 seconds), then removes those node_ids from every topic's membership set.
async fn run_staleness_pruner() {
let staleness_ms: u64 = std::env::var("MESH_STALENESS_MS")
.ok().and_then(|s| s.parse().ok()).unwrap_or(30_000);
let mut tick = tokio::time::interval(Duration::from_millis(5_000));
loop {
tick.tick().await;
let now_ms = SystemTime::now()
.duration_since(UNIX_EPOCH).map(|d| d.as_millis() as u64).unwrap_or(0);
let stale: Vec<String> = live_digests().iter()
.filter(|e| now_ms.saturating_sub(e.value().wall_time_ms) > staleness_ms)
.map(|e| e.key().clone()).collect();
for node_id in &stale {
live_digests().remove(node_id);
}
for mut topic_entry in topic_membership().iter_mut() {
for node_id in &stale {
topic_entry.value_mut().remove(node_id);
}
}
}
}
The self-injection on each node's gossip emit refreshes its own wall_time_ms, so a live node's entry never goes stale.
3. mesh_id_registry never pruned
Parallel to the peer PeerRegistry, there's a mesh_id_registry: DashMap<String, String> that maps peer_id → peer_mesh_id, populated from Hello frames so bridges can know which mesh a peer belongs to. The disconnect handler removed the peer from PeerRegistry but left the entry in mesh_id_registry. Over the soak, that map grew with every peer ever seen.
One-line fix in the existing disconnect handler:
Err(_) => {
registry.remove(&peer_id_str);
mesh_id_registry.remove(&peer_id_str); // added this
// ...
}
4. joined_peers HashSet only ever inserted
The fix from the previous post — joined_peers to dedupe join_peers calls — was correct, but I never made it shrink. If a peer disconnects, its entry in PeerRegistry is gone, but its entry in joined_peers lingers. On reconnect, the new connection wouldn't get join_peers called for it, because the old entry was still there.
joined_peers.retain(|p| registry.contains_key(p)) on every tick. Self-trimming.
5. dial_seeds one-shot, no retry
Not a leak per se but discovered during the same audit. dial_seeds did a single endpoint.connect() per seed at boot. If the seed wasn't up yet (race during cluster bootstrap), the child was permanently isolated — no retry, no fallback. Replaced with per-seed tokio tasks doing exponential-backoff retry: 1s, 2s, 4s, 8s, 16s, capped at 30s, max 10 attempts. Spans mesh.seed.retry and mesh.seed.giveup are emitted at INFO so seed-side issues are visible in Jaeger.
What I'd tell a team
- Soak the substrate before you trust it. A 30-second canary won't catch slow leaks. Five minutes won't either. The bugs in this post took 4 days to surface in a meaningful way. If your substrate matters, leave it running overnight before you ship anything on top of it.
- Process-global state needs an owner. Every
DashMapthat lives for the process lifetime needs a clear answer to "what removes entries from this?" If the answer is "nothing, the process restarts and it's fine" — that's a hidden cluster-restart dependency. Add a pruner. - Closed channels and overwritten registry entries are not free. The
tokio::select!continueon a closed channel that spins a core, theDashMap::insertover an existing key whose value held resources — both compile, both look fine in review, both cost you real CPU and memory. When you find yourself overwriting a key whose previous value held something withDropsemantics (aConnection, aJoinHandle, aFile), explicitly handle the old value. - The fingerprint matters. Top CPU consumers having zero peers was the clue. It told me the problem couldn't be peer-count-dependent — it had to be time-dependent and stateful. That narrowed the search from "anywhere in the gossip protocol" to "what state grows when we lose connections."
After the five fixes landed I started another soak, this time with the new pruner emitting mesh.staleness.pruned spans every 5 seconds and a 30s MESH_STALENESS_MS window. The topology view stayed populated with real values rather than ghost entries:

What's next
Steady-state works. The next question is what happens under load that isn't steady — when nodes die, when links flap, when clocks drift. The chaos battery I built for Sprint 02 has been sitting unused while the CPU work happened. Next week I unleash it on the substrate I just got working.
Chaos-pass replaces tests-pass
Steady-state passing isn't good enough for a substrate. I built a chaos harness with 13 primitives — kill, restart, partition, wedge, flap, clock-skew, slow-link, lossy-link, the works — and ran it against the just-stabilized mesh. Twelve tests, eight chaos-class, all green. Here's what each primitive surfaces and why "tests pass" by itself doesn't mean the substrate is sound.
After the soak fixes settled, the cluster ran flat for a week. CPU bounded, RAM bounded, peer counts stable, no nodes silently eroding. That's a green light for "the substrate works at idle," not for "the substrate is shippable." A control-plane substrate that only behaves under steady-state is the kind of thing you discover is broken six hours into an incident, when a single broker has been flapping and the rest of the cluster can't decide if it's dead.
So this week I pointed the chaos battery at it.
The principle
Sprint 02 of this project locked Golden Principle #5: chaos-pass replaces tests-pass. The idea isn't novel — Netflix put Chaos Monkey in production a decade ago, Jepsen has been making distributed systems vendors look bad since 2013. The novel part for this substrate is that the chaos battery is the same harness regardless of which feature sprint is running. Every sprint's exit criterion has to be "the existing tests pass under chaos," not "the existing tests pass."
That gives you a forcing function. A test that's green at idle but flaky under PartitionPair is broken, not flaky — the substrate has a real failure mode you're now choosing to ignore. The chaos battery is what stops you from ignoring it.
The primitives
The chaos crate (crates/mesh-chaos) ships thirteen primitives. Each one is a Rust struct implementing ChaosPrimitive with three things: an apply() that does the damage, a revert() that puts things back, and a detect() that tells the test framework what evidence to look for in the OTLP spans to confirm the substrate noticed.
| Primitive | What it does | What it surfaces |
|---|---|---|
KillNode | SIGKILL one node | Connection-drop detection; staleness pruner |
RestartNode | Kill + respawn same identity | Reconnect path; ghost-connection cleanup |
BurstKill | Kill multiple nodes at once | Quorum/membership behavior under sudden loss |
WedgeNode | Pause node's tokio runtime | Slow-vs-dead distinction; backpressure |
DiskFull | Fill data dir | Identity file fsync failures; graceful degradation |
PartitionPair | Break connectivity between two nodes | Spanning-tree heal; alt-path routing |
PartitionSubset | Isolate a subset from the rest | Split-brain detection; bridge survival |
FlapLink | Repeatedly up/down a peer link | Reconnect storms; idle-timeout interaction |
FirewallInbound | Drop inbound traffic to a node | Asymmetric failure; outbound-still-works case |
ClockSkew | Skew a node's wall clock | Staleness comparison robustness |
NatShift | Simulate NAT remapping | iroh's hole-punch re-establishment |
SlowLink | Add latency to a link | Backpressure; head-of-line blocking |
LossyLink | Drop a % of packets on a link | Plumtree's IHAVE retransmit path |
Eight of those have corresponding test files (tests/chaos/*.rs). The other five are in the harness but not yet wired into named tests — they're available for ad-hoc cluster torture.
The run
I bootstrapped 18 nodes — the standard 2-mesh layout, with bridges — and pointed the test runner at the chaos battery. The admin-ui Tests tab is where I watched the results land.

The numbers worth pulling out of that grid:
chaos-soak-9prim-5minran 307.7 seconds, fired 28 chaos events, all 28 passed within the 15% flake budget. That's the headline test — 5 minutes of continuous random chaos, the substrate doesn't fall over.remove-resiliencekilled 3 of 6 spawned nodes and verified all 3 survivors continued emitting heartbeats. The substrate notices, prunes the dead entries, keeps going.gossip-swarm-formsasserts that after a clean boot 4 nodes exchange ≥200 gossip digests within 34 seconds. That's the basic "Plumtree is actually doing its job" test.gossip-mesh-to-meshverifies that 100mesh.cross.peer_connectedspans fire across all services within 57 seconds — proving the bridge architecture actually bridges.backpressure-stream-floodfires 32 concurrent bi-streams of 1 KiB payloads for 10 seconds. 200+ round-trips, 0 errors. The data plane survives concurrent load.
What chaos catches that integration tests don't
The interesting one to me is the gap between the functional tests on top (5 of them — framer round-trip, frame truncation, traced-frame, unknown-tag rejection, bi-stream-echo) and the chaos tests on the bottom (7 of them, all chaos-tagged).
A functional test like framer-roundtrip answers "does the framer encode and decode correctly when nothing else is going on." That's necessary. It is not enough. The framer also has to be correct when the surrounding QUIC connection is being killed by KillNode, when the receiving node is being wedged by WedgeNode, when packets are being dropped by LossyLink. The chaos tests run the same framer code against those conditions.
backpressure-stream-flood is the cleanest example. The flood test by itself would catch "can the substrate do 200 round-trips in 10 seconds." It can. The chaos-tagged version of the same test catches "can it do 200 round-trips in 10 seconds while three random chaos primitives are firing in the background." That's a different question.
The timeline view is where chaos-period vs. steady-period events become legible. Every peer.connected event shows up with its source and target; chaos events get their own timeline rows; you can see them interleave in real time.

When the chaos harness fires KillNode, you see a node disappear from the timeline. Within a few seconds you see the surviving nodes re-establishing connections — peer.connected events fan out across the substrate as HyParView reshuffles the active view. The chaos worked; the substrate responded; both are visible.
What I'd tell a team
- Steady-state is the baseline, not the target. If your test suite only runs at idle, the test suite isn't done. It says nothing about how the substrate behaves under the conditions you'll actually encounter in production. Idle is the easiest case; you need the hardest cases too.
- Build chaos primitives once, reuse them everywhere. The 13 primitives in
mesh-chaosare shared between unit tests, soak runs, and ad-hoc torture sessions in the admin-ui. The cost amortizes immediately. The alternative — every test author writing their own kill-the-broker helper — gives you 5 incompatible flaky helpers and no shared vocabulary about what "broken" means. - Make chaos visible. The admin-ui's Chaos and Timeline tabs are not just dashboards — they're the operator's view of what's failing and what's recovering. Without them, "the test passed" is a green light and nothing else. With them, you can see the substrate noticing chaos, choosing to react, and re-establishing. The visibility is the test.
- Define a flake budget up front and stick to it.
chaos-soak-9prim-5minruns with a 15% flake budget — 4 out of 28 events can be allowed to miss their detection window before the test fails. That budget is what separates "we have flaky tests" from "we have a known reliability envelope." If the substrate ever runs at 50% flake, that's an architecture problem, not a test problem.
What's next
That closes the engineering arc. The next post wraps it up — the substrate is sound, the chaos battery is green, and the whole thing is going public. Open-sourcing it next week.
Open-sourcing the Rust Distributed Mesh
Five weeks of building, breaking, and fixing a P2P mesh substrate in Rust. Today I'm pushing the whole thing public — the iroh-based transport, the gossip emit loop, the staleness pruner, the example node types, the React dashboard. Not a thing to take and run in production. A thing to read while you're building your own.
Five weeks ago I started building a P2P mesh substrate in Rust on iroh. The point was never the mesh — it was learning what it actually costs to run one before betting a real product on it. The first canary pegged an 80-core box at 100% CPU with 18 nodes idling. The fifth week's canary holds the same 18 nodes at 5%, steady-state, through a 4-day soak.
Today I'm pushing the whole repo public.
→ github.com/drlukeangel/rust-distributed-mesh

It is not a library you should take and run in production. It's a kit you can read while you're building your own. The bugs I paid for are now diffs you can study. The dashboard I built so I'd believe my own numbers is in there too — React, plain Vite, no framework. The flamegraph captures are in /profiles.
What's in the box
| Piece | What it does |
|---|---|
crates/mesh-node-base | The substrate. Identity, gossip emit loop, peer registry, LoadSampler (self-reported CPU/RAM), staleness pruner. Built on iroh-gossip 0.98. |
crates/mesh-transport | Thin layer over iroh's Endpoint. ALPN, mDNS toggle, bind addr, 30s idle timeout. |
crates/mesh-telemetry | OTLP/tracing init. Every node's spans land in Jaeger; receive-time staleness is local, not sender-side. |
admin-ui/ | React + Vite topology view. Live node grid, hub-and-leaf layout, CPU/RAM bars per node, kill button per card. The thing you stare at when you don't believe the numbers. |
broker / gateway / compute / registry / bridge | Example node types. Each one is a 10-line main.rs that calls NodeRuntime::new("type").run().await. From the substrate's perspective they're interchangeable. |
Stack: rust (substrate) · iroh 0.98 + iroh-gossip (QUIC + NAT traversal + Plumtree/HyParView) · tokio · opentelemetry → Jaeger · react + vite (UI).
The five engineering posts in this notebook walk through the work in order:
- Why I'm building a distributed mesh substrate in Rust — the architecture, the iroh choice, what I expected to break.
- When 18 nodes pegged my 80-core box at 100% — the first round of obvious wins (mDNS off, gossip interval up, INFO spans down). 100% → 35%.
- Flamegraphing your way out of "this can't possibly be right" — the
join_peersstorm that I almost rewrote the architecture to escape. 35% → 5%. - Four days into the soak, the RAM was still climbing — the slow leaks that only a long-running soak surfaces. Ghost connections, unbounded global maps, the staleness pruner that ties it together.
- Chaos-pass replaces tests-pass — 13 chaos primitives, 12 tests, 5-minute soak under continuous random chaos. All green.
What the kit is, and isn't
This is a learning artifact. It is not:
- A finished product. The streaming layer that sits on top of this substrate — the actual event-streaming app — isn't open yet. What you're reading is what's underneath it.
- A managed iroh integration. It's an opinionated set of patterns for using iroh-gossip in a long-running process. Different from a library — closer to a scaffold.
- Production-ready. I've run it on one box. I haven't run it across NATs, across regions, or at scale. The substrate is correct under the workload I've tested; it's not proven outside it.
It is:
- A documented diff trail. Every fix in the four engineering posts above corresponds to a commit in the repo. You can
git logyour way through the optimization arc. - A flamegraph dataset. The 2 GB
.foldedfiles that surfaced thejoin_peersstorm are in/profiles. Runinferno-flamegraphon them and you'll see the bug. - A dashboard you can read. The admin-ui code is short — maybe 800 lines of TypeScript. It joins the mesh as a passive observer and renders the topology live. Plain React, plain Vite, no state framework. Lift it if it helps.
Telemetry is the second view of the same thing. Every span lands in Jaeger; the System Architecture DAG gives you the service-to-service call pattern without leaving the browser:

Numbers, before and after
| Metric | First canary | After five weeks |
|---|---|---|
| Host CPU (80-logical box, 18 nodes idle) | 100% | 5% |
| Per-node CPU avg | 0.83 cores | 0.05–0.10 cores |
| Per-node RAM avg | growing linearly | bounded at ~60 MB |
| Stable across 4-day soak | no — climbing on every axis | yes |
| Bugs I shipped in the first version | every single one in the four posts above | most fixed; one or two known sharp edges remain |
For comparison: a Bitcoin Core full node idles at 5–10% of one core. A Tor relay idles under 1%. The mesh substrate as it stands is competitive with both at this scale.
What I'd tell someone building one
I've said most of this in the four posts. Concentrated:
- The protocol probably isn't the bug. Measure first. I almost rewrote to a centralized-Controller architecture before checking. The flamegraph took 15 seconds to capture and made the question moot.
- Comments lie. Flamegraphs don't. Every CPU bug I found had a comment claiming the code was cheap. Trust the profile.
- Soak the substrate. A 30-second canary will not catch the bugs that take 4 days to surface. If the substrate is load-bearing, leave it running overnight before you build anything on it.
- Process-global state needs an owner. Every
DashMapthat lives for the process lifetime needs a clear answer to "what removes entries from this?" If the answer is "nothing, we just restart" — add a pruner before you ship. #[instrument]on an infinite loop is almost always wrong. The span never closes. The event queue grows forever. Decorate the work inside the loop, not the loop itself.
What's next
The streaming layer is what comes next, built on the substrate I just spent five weeks debugging — event streaming, multi-tenant by topic, nodes across networks — riding on a foundation that now actually idles when it has nothing to do. But there turned out to be one more leak hiding in the substrate first, which is where this notebook goes from here.
If you're building on iroh, fork freely. If you find a bug I missed, open an issue. The repo will keep moving.
Hunting a connection leak the soak test wouldn't explain
I'd already open-sourced this thing. Then a longer, meaner soak — kill and respawn for hours, not minutes — showed RSS climbing and never coming back down, on Windows and Linux both. The bug wasn't where I'd looked before. It was a select! arm that quietly switched itself off, and two more leaks hiding behind it.
I'd already pushed this whole thing public. The mesh worked: nodes discovered each other, gossiped, survived the chaos battery. The four-day soak had already cost me five leaks and taught me to never trust a thirty-second canary. I thought I was done with this class of bug.
Then I ran a meaner soak — kill and respawn on a tight loop, the kind you leave going for hours — and watched RSS climb and never come back down. Same shape on Windows, same shape on Linux. Classic leak, and not the one I'd already fixed.
This is the story of finding it, and the three things that turned out to be wrong at once.
The trap: theorizing about the reap path
The mesh runs an iroh QUIC transport under a HyParView/Plumtree gossip layer. Under churn, peers join and leave the active view constantly, so connections are born and reaped all day long. Every reap path I traced should have worked. The connection loop returned. close() got called. The maps got pruned. I spent hours reading code that, on paper, released everything it touched.
This is the same hole I fell into before the flamegraph post — sitting there reasoning about what the code should do instead of measuring what it did. The lesson refuses to stick the first time, or the second: when a leak resists reasoning, stop reasoning and instrument it.
The breakthrough: count, don't infer
So I stopped reading and added two counters to the gossip actor — connection-loop tasks spawned versus tasks finished — and ran a short soak. The numbers ended the debate:
spawned = 174, finished = 52 → 122 connection-loop tasks stuck alive
Meanwhile the membership map sat bounded at ~15 peers, exactly as it should. So roughly 107 connection-loop tasks had no peer state behind them and yet had never exited. That contradiction — live tasks with nothing to serve — pinned the bug to one place instead of the whole gossip layer.
Root cause #1: a select! arm that switched itself off
Here's the send loop, simplified to the part that mattered:
tokio::select! {
_ = &mut closed => break,
Some(msg) = self.send_rx.recv() => self.write_message(&msg).await?,
// ...
}
When a peer leaves the active view, its send_tx is dropped and recv() starts returning None. The trap is that the Some(msg) = … pattern doesn't deliver that None to me — when the pattern fails to match, select! disables that branch for the rest of the loop. The branch goes dark.
The only other long-lived arm, _ = &mut closed, stays pending forever, because nothing has actually closed the connection — that was supposed to happen because the send loop noticed the peer was gone. So the loop parks on a future that can never resolve. The send task hangs, the connection loop that owns it never completes, and the QUIC Connection and its driver task are stranded. One leaked connection for every peer that leaves — and the nodes that rotate peers the most leak the fastest.
The fix is to stop pattern-matching the channel closed away and handle the None myself:
msg = self.send_rx.recv() => match msg {
Some(msg) => self.write_message(&msg).await?,
None => break, // all senders dropped -> peer gone -> tear down
},
Stuck tasks went from 122 and climbing to a bounded ~15. This is a cousin of the reconnect leak from the soak post — both are a task outliving the thing it was serving — but the mechanism is different and nastier, because the code looks like it handles shutdown. The closed arm is right there. It just never gets a chance to fire.
Root causes #2 and #3, because leaks travel in packs
With connections bounded, a heap profile still grew — slower, but up and to the right. Two more, both smaller, both mine:
- Telemetry retention. The OpenTelemetry tracing layer was floored at
DEBUG. Under churn the network stack emits a debug firehose, andtracing-opentelemetryappends every captured event to the currently-active span's buffer — which is only freed when that span closes. My long-lived actor spans never close. So their event buffers grew without bound. One single 2 MB allocation in the profile turned out to be one span's event vector. The fix was a one-liner: floor the export layer atINFO. - Process-table enumeration. The per-node load sampler built its system handle by enumerating every process on the box, every tick — tens of thousands of transient name strings on Windows, every couple of seconds. It only ever needed our own process. The fix: don't enumerate the world; sample only our own pid.
Neither of those is exotic. Both are the kind of thing that compiles, reads fine in review, and costs you megabytes an hour in production.
The payoff
Measured with dhat, before and after, under the same kill-and-respawn soak:
| bucket | before | after |
|---|---|---|
| total retained heap | 42.9 MB | 8.1 MB (↓81%) |
| tracing / otel | 29.2 MB | 2.0 MB |
| sysinfo | 9.1 MB | 1.6 MB |
| quic connections | growing | bounded |
And the thing that actually matters — RSS troughs under sustained chaos went from a monotonic climb to a flat plateau, on Windows and Linux both. Over a 35-minute soak, an observer node's RSS at 140 chaos events dropped from 0.228 GB to 0.115 GB — about 66% lower — and dhat put the QUIC connection bucket at 18.7 MB → 3.85 MB.
Where the bugs actually lived
Here's the part I had wrong going in. I assumed — the way you always do — that the bug was in my code, not the library. The two small ones were: the telemetry floor and the process-table sampler were my config, one-line fixes. But the connection leak itself, the dominant one, was in the stack, and tracing it produced a cluster of fixes I submitted upstream to three crates:
- The gossip layer got the most. The
select!SendLoop footgun above; makingconnection_loopexit (it ran send and receive underjoin!, which waits for both — but the receive half blocks forever onaccept_uni()when a peer leaves locally, so I moved it toselect!); and pruning the per-peer state that outlived removed peers —peer_topics,peer_data, thelazy_push_queueonNeighborDown. Plus regression tests so the leak can't creep back. - iroh itself had a per-remote address cache (
AddrMapbehindmapped_addrs) with no eviction path — it grew once per remote ever seen under churn. The clean-shutdown path now evicts the departing remote's cached addresses. - The QUIC layer underneath leaked a whole connection task, packet spaces, and channels whenever a
Connectingwas dropped before its handshake finished — and before the handshake there's no idle timeout to eventually reap it. The fix was animpl Drop for Connectingthat drains and releases.
Two days of soak-and-profile to find them; the diffs themselves are tiny. That's the usual ratio for a leak that only shows up under sustained churn — the finding is the work, the fix is a few lines.
A thank-you to the people who built this
I want to stop and say this plainly, because it's easy to skip past: I got to find these at all only because the whole stack is open. I'm building on iroh and iroh-gossip — peer-to-peer QUIC, NAT traversal, hole-punching, a relay tier, Plumtree/HyParView gossip — none of which I could have written myself in a reasonable lifetime. It's built by the team at n0 (github.com/n0-computer), and the quality of it is the reason my "substrate" is a few hundred lines instead of a few hundred thousand.
And here's the part that still feels lucky every time: when I did hit real bugs deep in that stack, I could read the exact code, instrument it, prove the fault, and send a fix back — and there's a real, responsive community on the other end to receive it. That's not how it goes with a closed black box, where the best you can do is file a ticket into the void and build a workaround. The n0 folks have done years of genuinely hard systems work — the kind where a single select! arm or a missing Drop is the difference between flat and climbing memory — and they gave it away so the rest of us can stand on it. An enormous high-five to that whole team. We are extraordinarily fortunate to have makers like this, working in the open, on infrastructure this good. Thank you.
What I'd tell a team
- Instrument before you theorize. A spawned-versus-finished counter found in minutes what hours of reading the reap path missed. If a resource leaks, count the thing being created and the thing being destroyed before you reason about why.
Some(x) = exprin aselect!arm is a footgun wheneverexprcan legitimately yieldNone. The failed match disables the branch instead of surfacing the close. Bind the value plainly and match it yourself.- Leaks travel in packs. Fixing the dominant one just unmasks the next. Profile again after every fix — the flat line you were hoping for is usually one more leak away.
- Telemetry is not free. Events recorded inside a never-closing span live exactly as long as the span. A long-lived actor span at
DEBUGis an unbounded buffer wearing a tracing label. - Sometimes it is the library — and that's a contribution, not a complaint. I went in assuming the bug was mine, because it usually is. This time the dominant leak was in the stack itself, and the right ending wasn't a workaround in my code — it was a handful of small fixes submitted upstream so nobody else hits it. A leak found under your churn is worth fixing at the source.
What's next
The substrate is finally flat under churn — for real this time, measured, on both platforms. The fixes are upstreamed and the foundation holds. Which means I can stop poking at the substrate's memory behavior and turn back to the thing this notebook is actually about: making a multi-mesh fabric observable and correct, one sprint at a time.
Watching a node boot, then a second mesh with no bridge
The substrate gets a second life as a multi-mesh fabric — and the rule from day one is telemetry is the substrate, not a feature. A node that does work without leaving a trace is a bug. Here's a node naming itself, booting as a span chain you can read in Jaeger, building a topology directory out of gossip — and then a second mesh reached with no bridge node at all. Every claim is a screenshot of the live system.
The performance work earlier in this notebook left me with a substrate that idles when it has nothing to do and stays flat under churn. That was the foundation. This post starts the thing it was a foundation for: a small, observable multi-mesh fabric — named nodes in named meshes, a gossiped directory so any node can find any other, and not one line of bespoke infrastructure I could get from the library instead.
The rule I set on day one and never relaxed: telemetry is the substrate, not a feature. A node that does work without leaving a trace is a bug. So before there was any "product," there was a boot-span chain and a Jaeger instance to read it in.
A node names itself
When you spawn a broker into mesh1, it doesn't get a random hex id and it isn't named by some central authority. It loads (or mints) its own identity, then self-names from it:
node_name = <mesh>.<type>.<first-6-hex-of-node-id> e.g. mesh1.broker.239dc9
The mesh is required — an unlabelled node fails fast — the type is what it is, and the suffix is the first 6 hex of the node's public key, so the friendly name is eyeball-matchable to its unique id. No registry hands out names; identity is the name.
The boot is a span chain
Every boot is one trace rooted at node.ready, with the bring-up steps as children — endpoint_created → alpn_registered → gossip_started → accept_loop_started, each a few hundred microseconds, all under one root. If a node is misbehaving, the trace tells you exactly how far it got before it stalled.

This is the spine. Get the boot chain visible for one node and you can always answer "did it start, and how far did it get" without attaching a debugger.
A topology cache, built from gossip
Each node broadcasts a small digest on its mesh's gossip topic, including its reachable address. Every node accumulates those into a process-local directory — name → {mesh, type, location} — surfaced as a Cache view. This is the thing a node consults to answer "where do I send to reach X?" There's no second gossip system and no database: the cache is the gossip digests, surfaced.

A second mesh — and deleting the bridge
The earlier design had a bridge: a special node that joined two meshes' gossip and shuttled awareness between them. It's the obvious first idea, and it's the wrong one — it's a bespoke piece of mesh infrastructure, and the substrate's whole premise is don't build mesh infrastructure, use the library. So the bridge had to go: out went the bridge node type, its spawn button, its env knobs, the whole crate. Two meshes now run side by side with no node whose only job is to connect them.

Per-mesh gossip lives on a per-mesh topic (blake3(mesh_id)), so a mesh1 node doesn't see mesh2 by default. The interim answer: the gateway — the node that writes across meshes — also subscribes to the other mesh's gossip, so its directory spans both. A simulated writer on mesh1.gateway resolves mesh2.broker from that directory and sends to it. The proof isn't a line in a log; it's the cross-mesh trace stitching end to end:

The catch (named, not hidden)
"The gateway subscribes to the other mesh's entire gossip" works for two meshes on one host. It does not scale: with many meshes and hundreds of gateways it's an O(meshes²) firehose of every remote node's per-tick digest, and it quietly forces the console to special-case itself as an all-mesh subscriber. That's a real debt, and the next post pays it down with a proper control-plane backbone — after first fixing something more embarrassing: the traces themselves were lying.
What I'd tell a team
- Make telemetry the floor, not the polish. If "does this node work" can only be answered by reading its logs by hand, you'll be reading logs by hand forever. A boot-span chain is cheap and it's the thing you'll lean on every single debugging session after.
- Let identity be the name. Self-naming from the public key means no allocator, no name collisions, no central authority to be down — and a friendly id you can still match to the real one by eye.
- Refuse the bespoke node. A "bridge" felt necessary and wasn't. Every special node type is infrastructure you now own and operate; reach for the library's primitive before you invent one.