A two-year platform timeline drawn as growth rings on a tree cross-section: eight scattered device-API dots at the core converge into a single clean spine of rings, each ring a quarter of consolidation, auth, OTA, and migration work, widening outward into one solid platform.

Two years on medical IoT — the platform retrospective

#retrospective #leadership #api-platform #connected-products #medical-device

Two years ago this week I took a role I didn't fully understand the shape of: leading the API platform behind a connected-health portfolio. The job title said "platform." The org chart said "cloud team." What it turned out to be was the thing that holds a connected hardware product together for its entire life — the layer the toothbrush phones home to, the place every brushing session lands, the contract four other teams negotiate against. It was my first time owning that layer end to end, and I'm leaving it next week.

So this is the post I'd want to read if I were about to take the same job. Not a victory lap — a ledger. Two years, eight starting APIs, one platform, roughly a million devices on it by the time I'm walking out the door. What the architecture got right, what it got wrong, what I'd undo, and the things I was sure were mistakes that turned out fine. The one sentence I keep coming back to: the boundary between "platform engineering" and "IoT engineering" is mostly fictional, and I spent two years finding that out the long way.

The two-year arc

It's worth laying the whole thing out on one line first, because the shape of the arc is the argument. Thirteen of the twenty-four months were platform plumbing that shipped no new customer feature. That was the bet, and everything good downstream paid out of it.

Q4 2017 — diagnosis. Eight separate device APIs in production. Five had their own definition of "user." Two were formally deprecated and still serving live traffic. My first quarter was spent reading code I hadn't written and writing the memo that said, out loud, this is not a portfolio, it's eight products wearing a trench coat. The hardest part of the quarter wasn't the diagnosis; it was making the consolidation case to product leadership three separate times before it stuck.

Q1 2018 — privacy classification. Four hours in a conference room with the privacy office and a printout of the device payload, one field at a time, produced the three-tier data model. I went in thinking I'd get a yes/no per field and came out understanding that what regulates a datum is the claim and the join, not the bytes. That memo set the architecture for everything after it — every storage decision downstream inherited the rule "default to de-identified, promote only through a logged join."

Q2 2018 — the entity domain model. A six-week design exercise across every product line, ending in five entities: Account, Device, Consumable, Session, Event. Five entities that covered every device line we had and the ones the roadmap was adding. The adult-brush migration started the same quarter.

Q3 2018 — the auth model. Designed and shipped phone-as-gateway: per-device signing keys, BLE bonding for device-to-phone trust, OAuth for human-to-cloud, and a cloud that verifies what the phone can only carry. A BLE-only device can't authenticate to the cloud directly, so the whole scheme is about making a forgeable middleman's forgeries useless. This was the operational spine.

Q4 2018 — the first line on the new platform. The adult brush, biggest install base, migrated first. Fourteen weeks of strangler-fig work — dual-write to old and new, nightly reconciliation, thirty clean days, then cut reads over to v2.

Q1 2019 — OTA goes live. Shipped the over-the-air firmware pipeline — signed images pushed through the phone, dual-bank flash, verify-then-commit. This is the quarter of the canary near-disaster, which I'll come back to, because it's the best argument in here for a thing I almost cut.

Q2 2019 — the rest of the fleet. Kids' brush and interdental migrated, both faster than the first because we'd learned. The dentist-portal feature started — a thing that would have been a six-month integration on the old architecture and was a one-week query against the new one.

Q3 2019 — maturity, and the door. The weekly device-platform sync had been running a year, and the cadence mismatch between an 18-month hardware clock and a two-week cloud clock had stopped being a chronic source of pain. Roughly a million devices on the new platform. A version of this architecture is still running. And I started planning my exit, which is the most honest signal I can give that the platform no longer needed me to hold it up.

What I got right

Consolidating the entity domain model before shipping a single feature. This was the bet the whole arc turned on, and it was deeply unpopular at the time. Thirteen months of platform work, no new customer-visible feature, while product leadership watched the roadmap sit still. I made the case three times before it took. The reframe that finally landed wasn't "this is good engineering hygiene" — nobody funds hygiene — it was "the dentist portal is a six-month integration today and a one-week query the day this is done." Put a feature they wanted on the other side of the bet and the bet sells itself. It paid for itself inside a year and compounded after. The teams that win on platform investment are the ones that take the unsexy bet early, and the only way to get the org to take it is to name the sexy thing it unlocks.

Forcing OTA to be production-grade before it scaled — the canary near-disaster. This is the one I'd point to first if someone asked what discipline bought us. In Q4 2018, planning the first big rollout, there was real pressure to ship OTA at half-quality — transfer the image, flash it, done — and bolt on the safety later. We didn't. We built the dual-bank verify-then-commit, the boot-counter rollback, and a post-update fleet-health dashboard that watched devices after they'd taken the update, not just whether the bytes arrived.

Q1 2019, first real canary cohort, the dashboard lit up: devices were taking the update, reporting transfer success, and then going quiet over the next few hours. A firmware bug that only manifested under a specific post-boot condition the bench tests never hit. Because we were monitoring fleet health and not just transfer success, we caught it at a few hundred devices and halted the rollout. The version of this story where we shipped "transfer success = done" is the version where we push that image to the whole fleet over a weekend and brick twenty thousand toothbrushes in people's bathrooms. The safety work I almost cut is the only reason that sentence is hypothetical.

Bonding trust to a physical event. Re-pairing a device to a new account required a physical button-press on the device itself. The customer-experience team hated it — they wanted seamless, tap-to-transfer ownership — and they had a real point about the friction. I held the line anyway, because the alternative is a remote re-bond path, and a remote re-bond path is an account-takeover vector you ship to a million homes. No security audit in two years ever turned up a remote-rebond surface, because there wasn't one to find. Friction in the right place is a feature.

Treating the phone as a flaky courier, not a trusted client. Sign on the device, verify in the cloud, trust the phone for nothing load-bearing. The phone is a thing we shipped to an app store, running on hardware we don't control, that an attacker can decompile — so the whole auth model is built to make its forgeries useless rather than to trust it not to forge. This is the single principle I've carried, unchanged, onto every connected-product platform I've touched since. It travels because it's not about phones; it's about never putting trust in a hop you don't control.

What I'd undo

There's a through-line in this column I didn't see until I wrote it down: every one of these is a version of I wasn't in the room early enough. The mistakes weren't bad calls. They were calls I didn't get to make because I showed up after they were already made.

Letting hardware design the head-ID byte format with no API input. The byte format the chip used to identify a brush head was settled by the hardware team in mid-2017, before I started — frozen into silicon by the time I read it. When the platform went to model the Consumable entity in Q2 2018, the format fought us: no embedded version number in the head ID, no manufacturer field, no lot code. All things cloud-side analytics wanted and couldn't have, because the bytes were already shipping in the field and you can't change a contract a million units depend on. We wrote workarounds — a side table, an inference heuristic, two engineer-months of it. Two more bytes in that frame in 2017 would have erased all of it. The lesson is blunt and I've held to it since: be in the hardware spec meeting from week one, because the cheapest field in the world is the one you ask for before the board is laid out, and the most expensive is the one you wish you'd asked for after it ships.

Treating the smaller device lines as "later" for longer than I should have. I let the kids'-brush and interdental migrations drift into Q2 2019 when I could have pulled them into Q1. The reasoning felt sound — smaller install base, the platform work could "wait," spend the capacity on the big line. What actually happened is that the longer a line sat on its own old API, the more its own little customizations accreted, and the more there was to reconcile when we finally migrated it. Small systems don't stay small and clean while they wait. They grow their own weight. Migrate the cheap ones while they're still cheap, before drift makes them expensive — the opposite of the instinct to do the big valuable one first and mop up later.

Building OTA failure telemetry only just in time instead of early. The fleet-health dashboard that caught the canary brick was built in Q4 2018, weeks before the first rollout needed it. It worked — but if I'd built it in Q2 2018 I'd have had two quarters of baseline behavior to compare the canary against, and the anomaly would have been even louder and earlier. It wouldn't have changed the outcome that time. But the principle generalizes and I underweighted it: the only telemetry that helps is the telemetry you were already collecting before the event you need to detect. You cannot instrument a fire while it's burning. The dashboard you stand up the week you need it is the dashboard with no normal to measure against.

What I thought were mistakes that turned out fine

These are the calls I second-guessed at the time, braced to regret, and didn't. Worth naming, because "the conventional wisdom said X and we did Y and Y was right for us" is its own kind of lesson — the one about knowing your actual workload instead of the workload in the conference talk.

Picking Postgres over DynamoDB. A connected-product platform with a million devices doing a few sessions a day each: the 2018 conference-circuit answer was DynamoDB, full stop, NoSQL-at-scale, relational-won't-keep-up. We put the platform on Postgres on RDS instead, and I expected to be writing the "why we migrated off Postgres" post within eighteen months. Two reasons we didn't: the team's operational depth in Postgres was real and DynamoDB depth was not, and — the bigger one — our domain model was relational to the bone. Account owns Device owns Consumable, Session joins back to all three. That's a graph of foreign keys, not a bag of denormalized items, and forcing it into single-table DynamoDB would have meant fighting the data's actual shape to satisfy a scaling story we hadn't yet hit.

It scaled fine. Our write volume and access patterns sat comfortably inside what one well-tuned RDS instance handles. The honest read in hindsight: for a pure telemetry firehose I'd reach for a different store — high-volume append-only time-series is exactly DynamoDB's wheelhouse, and that's a tradeoff I've written about since — but our workload was a relational entity model with a telemetry side, not a telemetry firehose with some metadata, and we picked for the workload we had rather than the one the talks were about.

Home-grown REST ingestion instead of a managed device-cloud. The other call I braced to regret. The managed IoT-cloud option on the table assumed devices speak MQTT directly to a broker — and ours couldn't. Our device's only radio was BLE; every byte reached the cloud inside an HTTPS request made by the customer's phone, not the device. A device-direct broker model has nowhere to put a phone-as-gateway topology. So we built our own ingestion: REST with idempotency keys for the late-and-duplicated reality of phone-relayed uploads, per-device signing so the cloud could verify what the phone merely carried, append-only events. For a BLE-only fleet in 2018 it was simply the correct call — the managed option didn't fit the topology, not "we preferred to build."

The wrinkle that makes this a "turned out fine" rather than a "got right" is the part I couldn't have known: the home-grown stack was correct for its era and would be the wrong call in a later one. The day a connected device ships with its own WiFi, it can speak to a managed broker directly, the phone stops being load-bearing, and rolling your own ingestion goes from necessary to indulgent. I didn't make that newer call here — different team, different era, different radios — but the medical-IoT stack quietly taught me to date my architecture decisions. The right answer in 2018 and the right answer later aren't the same answer, and a decision that doesn't carry its own expiration date is a decision you'll defend past the point it's true.

The three things this arc taught me

Strip away the toothbrushes and the HIPAA memos and the byte-packed BLE frames, and two years left me with three convictions I haven't had to revise since.

One: an API platform and an IoT platform are the same thing wearing two name tags. I took this job thinking they were different disciplines and spent two years discovering the overlap is nearly total. Build an API platform right — versioned contracts you never mutate, append-only events, per-device attestation, a clean entity model underneath — and you have already built an IoT platform. The "IoT" framing adds a transport layer and a marketing budget. Everything load-bearing underneath is the same platform engineering it always was. I stopped treating "IoT" as a separate skill the day this clicked.

Two: the hardware constraint is the product constraint, not a limitation to apologize for. A device with no WiFi isn't a device missing a feature; it's a fundamentally different product with a different platform architecture. Our entire phone-as-gateway auth model, our whole ingestion design, the 20-byte BLE frame discipline — all of it falls directly out of "the radio is BLE and the battery is small." The teams that struggle treat that as a temporary annoyance to be engineered around. The teams that ship treat it as the first design input. The constraint isn't in the way of the architecture. The constraint is the architecture.

Three: the platform compounds; the hardware doesn't. This is the one I'd staple to the whole series. Every connected product you ship gets a fresh PCB, a fresh BOM, a fresh manufacturing line, a fresh certification — the hardware cost resets to zero with every device and you pay it again, in full, every time. The platform doesn't reset. It's there when the next device shows up. Invest in it and each new product is cheaper than the last, because it inherits the entity model, the auth, the OTA pipeline, the telemetry. Skip the investment and each new product is more expensive than the last, because it's another bespoke integration onto a pile of bespoke integrations — which is exactly the eight-API mess I walked into. The dentist portal going from a six-month integration to a one-week query is this curve made visible.

What this set up for me

I didn't know it walking out the door next week, but everything in this notebook is the first draft of a playbook I'd run again. Some years on, I came back to connected products from a different seat — my own engineering team, owning the hardware and the platform this time, in an era where the device carried its own radio and a managed device-cloud was the obvious default rather than a topology mismatch. The second run is its own series: the same arc — entity model, OTA, fleet identity, operational telemetry — with a newer toolchain and a wider scope. The longer retrospective across both closes the loop on what actually compounded from one era to the next, and the honest answer is: the principles did, the tooling didn't.

The medical-IoT years were the first draft. Cutting my teeth, literally — on a product you keep in a cup by the sink.

So the one thing I'd tell anyone starting on the platform side of a connected hardware product: build the architecture that survives the third device line, not the one that ships the first. The first device makes the demo. The third device is where a platform either pays you back or sends you the bill. Build for the third one. The work compounds — that's the entire point, and it's the only reason any of these thirteen unglamorous months were worth it.

Keep reading

shares tags: #retrospective · #leadership

craft

4.5 years of connected products — what I'd do again

Nov 18

craft

Atom's last year — what the data told us, and what I missed

Dec 05

tools

Designing a connected health device with BLE 4.2

Sep 12