Notebook · 07 parts Building Medical IoT Connected Products
Two years (2017–2019) building the API platform behind a BLE consumer-health portfolio. Phone-as-gateway, home-grown REST, and the OTA push that haunted my dreams.
Seven posts covering the platform side of building consumer-health connected devices in the era before AWS IoT Core was a default — drawn from two years (2017–2019) leading the API platform at Philips Connected Health. Written from the platform-engineering perspective: I was the API provider, not the hardware integrator. Read in order it's a two-year chronology; read out of order each one stands alone. This is the v1 — the work that taught me what "connected product" actually means before the modern IoT toolchain made it easy.
Designing a connected health device with BLE 4.2
Taking over the API platform behind a connected toothbrush line means first reckoning with what BLE 4.2 actually offers — and what it doesn't.
I'm six weeks into running the API platform behind a connected-health portfolio. The flagship is a connected-toothbrush line — an adult brush with brush-head-aware features, and a kids' brush with a companion app that turns brushing into a game. These first six weeks have been a crash course in what consumer-health BLE actually looks like in 2017, and most of what I assumed coming from web services turned out to be wrong.
Here's the one fact that reorganizes everything else: the brush has no radio but Bluetooth. The only device in BLE range is the customer's phone, so every connected feature — telemetry up, firmware down, the auth that proves any of it is real — pivots through that phone. You don't get to decide whether the phone is in your architecture. It already is, on every path that matters.
What BLE 4.2 is, in 2017
Bluetooth Low Energy 4.2 is the standard we build to. BLE 5.0 was adopted in late 2016, but consumer phones are still split — plenty of the Android handsets in our install base will never see a 5.0 stack, and Apple hasn't said much about what's in the next iPhone. Targeting 4.2 is the only call that covers the customers we actually have.
The numbers that shape every decision downstream:
- MTU: 23 bytes per ATT packet by default — three of those bytes are header, so you get 20 bytes of payload. Negotiable up to 247 with the LE Data Length Extension, a 4.2 feature. Plenty of older phones won't negotiate it, so 20 bytes is the number you design against.
- Throughput: practical max around 10 kbps in 2017-era pairings once you account for connection-interval throttling and the BLE link layer. Faster on a flagship Android, slower on iOS.
- Connection interval: anywhere from 7.5 ms to 4 s. You request an interval; iOS overrides it to whatever it prefers, and the phone — not the device — owns that decision.
- Bonding: after the first pair, a long-term key (LTK) is cached on both the phone and the device. That's the foundation of trust on this platform; I'll come back to it.
- GATT: every feature is exposed as a service (a UUID) holding one or more characteristics (each its own UUID), and each characteristic supports some mix of read, write, notify, and indicate. This is the entire API surface the phone sees over the air.
- LE Secure Connections: also new in 4.2 — pairing backed by ECDH key agreement (P-256) instead of the easily-broken "Just Works" exchange that 4.0/4.1 left us with. It matters more here than on a fitness band, because the data is health data.
So the protocol-design problem reduces to one sentence: encode telemetry into GATT characteristics that fit through a 20-byte pipe over a transport you don't control. The cloud platform's job is to be sane about everything that pipe drips out the far end.
How the brush exposes itself: a GATT profile, not an API
Before the architecture, the thing on the wire. The brush doesn't have an HTTP API; it has a GATT profile — a little tree the phone walks after it connects. Services group related characteristics; each characteristic is a typed value the phone can read, write, or subscribe to. For our brush it looks roughly like this: a device-information service (serial, firmware revision), a battery service, and a vendor brushing service whose characteristics carry session data out and accept configuration in.
The operation each characteristic supports is the part that bites you. Notify lets the device push a value without an ack — cheap, but lossy if the phone misses it. Indicate is the acked version — reliable, but it costs a round trip per packet, and at a 20-byte MTU a single brushing session is a lot of packets. We stream bulk session data over notify and reserve indicate for the control point, where losing a "sync complete" would actually corrupt state. That choice — notify for volume, indicate for correctness — is the kind of thing no spec tells you; the radio budget tells you.
The constraint that shapes everything: no Wi-Fi on the device
The brush has no antenna for Wi-Fi. No cellular. No port you'd plug into a network. The only radio is BLE, and the only thing in BLE range is the user's phone. That single fact dictates the architecture:
- The phone is the gateway. Every byte of telemetry reaches the cloud through the phone's app — there is no other door.
- Connectivity is intermittent. Someone brushes for two minutes, twice a day. The brush is in range, with the app foregrounded, for maybe four minutes out of every twenty-four hours. The other 23 hours and 56 minutes, the cloud has no idea the device exists.
- Storage on the device is non-trivial. If the app is closed when they brush — the common case — the device has to store the session and replay it on next connect. That means on-device flash for telemetry, plus a sync protocol with backfill and a cursor.
- Auth has to anchor on the phone. The device can't reach the cloud to prove itself. Trust has to bridge two domains: phone-to-cloud (a normal OAuth dance) and device-to-phone (BLE bonding). Stitching those two together is its own problem, and I'll give it its own post.
Here I want to stay on what this connectivity model does to the API platform.
API platform implications
Coming from web services, my instinct was a synchronous REST model: phone POSTs a session, server responds, done. That instinct breaks immediately, in four separate ways:
- Sessions are recorded on the device, not the phone. The phone has to drain them over BLE before it can upload anything.
- Sessions can be days old when they arrive. Someone brushes all week at home with the app closed; the phone first comes into range and foregrounds the app at an airport gate on Friday. Five days of sessions land at once.
- Sessions can arrive out of order if the device's clock has drifted — small consumer devices don't always carry a battery-backed RTC, so "now" on the brush is a guess between syncs.
- Sessions can be duplicates if the phone-device sync logic loses its place and re-drains a range it already uploaded.
So the API can't be a request/response that trusts what the phone hands it. It has to be an append-only event ingestion endpoint built for replay. Every event — a brushing session, a head-attached, a battery-low — carries an idempotency key derived from (device-serial, monotonic-counter). The counter is the device's, not the phone's, and it only ever increases. Dedup happens server-side on that key, so a re-drained range is harmless. Ordering is reconstructed from the device's own timestamp, with server-side correction when the clock is obviously wrong (a session dated 1970, or three years in the future, gets clamped to its arrival window and flagged).
If that shape sounds like a hand-rolled, lighter-weight version of MQTT-on-AWS-IoT, that's because it is. AWS IoT exists today — the service is being renamed AWS IoT Core at re:Invent later this year — and we evaluated it seriously. We're not adopting it, and the reason is topology, not religion: an IoT broker expects the device to authenticate and publish directly, and our device can't reach the broker at all. It would have to speak MQTT to the phone, which relays to the broker — at which point the phone is doing all the work and the broker is an HTTP endpoint with a flakier transport in front of it. The right move is to build the cloud side as if it were a broker — append-only, idempotent, per-device attested — so that a future product with Wi-Fi on the device could adopt a real broker without the backend changing shape. I'll write up that decision, and the auth model that anchors it on the phone, on their own.
The same principle holds on Azure or GCP: their device gateways (IoT Hub, the Cloud IoT Core that Google currently offers) also assume a directly-connected device. None of them fit a BLE-only product without the phone in the middle — so the gateway buys you little until the hardware grows its own radio.
The pairing handshake, and why bonding is the trust anchor
The reason the phone can be a gateway and not just a relay is bonding. On first pair under LE Secure Connections, the device and phone run an ECDH (P-256) exchange, derive a shared long-term key, and each store it. From then on the link is encrypted with that LTK, and either side can recognize the other on reconnect without re-pairing. That stored LTK is the device's only durable relationship with anything in the world.
It's worth being blunt about what that key is and isn't. Bonding establishes trust between two specific physical objects — this brush and this phone. It says nothing about which human is holding the phone, and it does not extend to the cloud. So the LTK secures the BLE link, but it can't be the thing the cloud trusts; the cloud has never seen it. That gap — link trust on one side, account trust on the other, and a device that touches neither directly — is exactly the bridge I'll have to build when I get to the auth model. For now it's enough to see that bonding is necessary and nowhere near sufficient.
iOS Core Bluetooth vs Android BluetoothGatt
I run the API platform; two senior engineers run the mobile side. The same two complaints surface from them every week, and both land back on my doorstep as ingestion behavior:
- iOS Core Bluetooth: cleaner abstractions, stricter cage. Background scanning is gated behind specific declared use cases — Apple does not love a consumer-health app scanning for BLE in the background — so we can't assume the app wakes itself to drain the brush. State preservation across backgrounding mostly works. Connection intervals are whatever iOS decides; our request is a suggestion.
- Android
BluetoothGatt: messier abstractions, hidden cliffs. A missedonConnectionStateChangecallback can leave the stack in a half-connected state the app thinks is healthy. Behavior diverges hard across Samsung, Huawei, and Xiaomi, and auto-reconnect is unreliable enough that we treat it as best-effort.
The throughline for the platform: the gateway is unreliable on every axis at once. Telemetry shows up late, out of order, duplicated, or not at all — and on mobile OS terms we don't control. That's the median case, not the tail. The cloud has to be built for it, because there is no fixing it from the phone.
What it cost me to learn this
I'll name the mistake, because it set us back a sprint. Coming in, I had the cloud team stand up the obvious thing first: a POST /sessions endpoint that took the phone's payload, trusted its timestamp, and wrote a row. It demoed perfectly on a desk, where the phone is always in range and the clock is always right.
It fell apart the first week real units were in real pockets. Sessions arrived in clumps days late, stamped with a drifted device clock, and — the one that actually hurt — duplicated, because our first sync-cursor logic re-drained a range after a dropped connection. Engagement dashboards showed people brushing twelve times a day. For a health product, inflated adherence numbers aren't a cosmetic bug; they're the kind of thing that, downstream, someone might quote to a clinician. We caught it, but only because the numbers were absurd enough to disbelieve. Had the duplication been 10% instead of 100%, it would have shipped.
The fix was to stop trusting the gateway and move idempotency and ordering server-side — the append-only model above. The lesson I'd hand the next team: the demo that works on your desk is lying to you, because your desk has none of the conditions the product lives in. Build for late, duplicated, and out-of-order on day one, or rebuild for it on day thirty.
What I want to carry forward from this
Three principles from this opening period I intend to hold to:
- Treat the phone as a flaky gateway, not a trusted client. Sign telemetry on the device, verify in the cloud, and don't take the phone's word for anything load-bearing — least of all timestamps and counts.
- Design the ingestion API for replay. Append-only events, idempotency keyed on the device's own monotonic counter, ordering reconstructed server-side. No durable state on the gateway, because the gateway forgets.
- Bake the radio constraint into the product spec. The hardware team's ~10 kbps practical throughput isn't a detail — it's a product constraint that sets feature scope. Two-way streaming feedback during a brushing session isn't viable through a 20-byte pipe; one direction, batched and replayed, is. Features that ignore the radio budget don't ship; they just find out later.
What's next
Before any of this architecture earns its keep, there's a more basic question I skipped past: what kind of product is this, legally? A toothbrush that gamifies brushing and a device that records a physiological signal are governed very differently, and the answer decides which of these design choices are nice-to-have and which are non-negotiable. The next post takes that on — HIPAA, FDA Class I, and what actually counts as medical-device data.
HIPAA, FDA Class I, and what counts as medical-device data
Before the API platform can matter, I had to answer a narrower question than 'are we a medical device.' The one that decides the architecture is: which fields are regulated, and what does that classification force on every store they touch?
Four months into the platform job, I spent four hours in a conference room with the privacy office and a printout of the device-event payload. Every field, one line at a time. Is brushing duration protected health information? Is a brush-head-replacement timestamp? Is the device serial number? I came in expecting a yes/no per field and left understanding that I'd asked the wrong question. Almost no single field is regulated on its own. What regulates a field is what it's joined to — and that turns a compliance question into an architecture question, which is the only reason it landed on my desk.
This is the post the last one promised. I'd just spent six weeks establishing that the phone is a flaky gateway and the cloud has to be built for replay. Before any of that architecture earns its keep, there's a more basic question: what kind of product is this, legally — and which bytes flowing through the platform carry that weight?
Two regimes, and the one that's coming
Three things govern this product, and on today's date they're at very different distances.
FDA Class I is here now. The connected toothbrushes are Class I medical devices — the lowest-risk tier, general controls only, no premarket submission. "General controls" still means the device does what its labeling claims, manufacturing follows the Quality System Regulation (21 CFR 820), adverse events get reported under MDR, and the device is registered and listed. None of that is the API platform's problem directly. The platform's exposure to the FDA is narrower and sharper, and I'll get to it.
HIPAA is here now, but conditionally. HIPAA bites when there's protected health information — health data tied to an identifiable person — held by a covered entity or its business associate. Selling a toothbrush direct to a consumer makes the company neither. We're a manufacturer with an app, not a clinic. But we're actively chasing dental-practice partnerships, and the day a practice pulls our data into a patient's chart, the company becomes that practice's business associate and signs a BAA. HIPAA doesn't apply to the platform today; it applies the instant a specific deal closes. The architecture has to be ready to flip that switch per-partner without rewiring.
GDPR is not here yet — it starts enforcing 25 May 2018, four months out. I'm writing the classification now specifically so we're not retrofitting in April. It widens the aperture in a way US privacy law doesn't: under GDPR, the pseudonymous user IDs I'm about to describe are still personal data, and an EU user gets access and erasure rights over them. I'm designing to the coming rule, not the current one, because the install base already has European users and the regulation won't wait for us to be ready.
That's the whole regulatory weather. Everything below is how I turned it into storage decisions.
The line that actually defines "medical-device data"
Here's the distinction the four-hour meeting was really about, and it's the one most people get backwards. The thing that makes data medical-device data — the thing the FDA cares about — isn't the sensor. It's the claim.
The FDA's General Wellness: Policy for Low Risk Devices (final guidance, 2016) draws the bright line I lived against. A product that promotes a general healthy lifestyle — "brush twice a day, you'll have healthier habits" — is a general wellness product, and the FDA exercises enforcement discretion: it doesn't regulate the software. The moment the same product makes a claim about a specific disease or condition — "this detects early gingivitis," "this device treats your periodontitis" — it's no longer wellness. It's a medical claim, and the data feeding that claim, and the software making it, come into scope.
So whether our coverage estimate is "medical-device data" is not a property of the coverage estimate. It's a property of what marketing writes on the box. The same quadrant-coverage number is wellness data under "build better brushing habits" and regulated data under "detects areas you're missing that lead to gum disease." That's terrifying from an engineering seat, because it means a feature can change regulatory class without a single line of firmware changing — someone in another building edits a claim.
The defensive move the privacy office and I agreed on: the platform stores and serves the raw signal, and claims live in the application and marketing layer, never baked into the data contract. The event store knows "quadrant 3 brushed 14 seconds." It does not know, and must not encode, "user is under-brushing in a way that indicates disease risk." Keep the data dumb and the claims thin and movable, and a marketing decision can't silently drag the whole telemetry pipeline into 21 CFR scope.
What we actually store
To classify, you have to enumerate. Per session, the device emits:
- Session start timestamp (local + UTC offset)
- Duration brushed
- Pressure events (when the user presses too hard and the brush vibrates to back off)
- Coverage estimate (which quadrants, how long, derived from the device's motion sensors)
- Brush-head ID at the time of session
- Battery level at session start
- Firmware version
Per device:
- Device serial number (the manufacturing identifier, baked into firmware)
- Bluetooth address (the over-the-air identifier)
- Hardware revision
Per user, in the app:
- Name
- Date of birth (optional — used by the kids' brush for age-appropriate programs)
- Dentist (optional — for the dentist-portal feature in design)
- Linked device serials
Read that list and the trap is obvious in hindsight: nothing in the device block is health information, and nothing in the user block is, on its own. Email is just email. Coverage is just a number. The regulated thing only exists at the join — the row that says this person brushed this badly. Which means the architecture problem isn't protecting fields. It's controlling joins.
The three tiers we landed on
The privacy office and I settled on three tiers, defined not by sensitivity-in-the-abstract but by what identity is attached server-side.
Tier 1 — device telemetry, no identity. Anything keyed only to a device serial, with no user identity attached on the server. Duration, pressure, coverage, brush-head ID. Ordinary product analytics. Lands in the event store, flows to the data warehouse, no special handling. The overwhelming majority of events live here, and they're allowed to.
Tier 2 — pseudonymous, user-linked. The same telemetry joined to a stable user ID that is not derived from email or any directly identifying field — a random surrogate key, with the mapping held in a separate table. You can ask cohort questions of Tier 2 ("users who replace heads on schedule have 12% fewer pressure events") without ever resolving a row to a human. Under HIPAA's analysis this isn't PHI, because it isn't identifiable without the lookup. Under the GDPR that's coming in May, it is personal data — pseudonymous, but still in scope — which is exactly why I keep the mapping table as its own access-controlled thing rather than a column.
Tier 3 — identifiable. Identity (email, name, DOB) joined to brushing data. This is the only tier HIPAA can ever touch. Stored apart, stricter access, audit logging, encryption at rest under its own key, and a published deletion path.
The point of defining tiers by attached identity rather than by field name is that it survives new fields. When the next sensor or feature shows up, I don't relitigate its sensitivity in the abstract — I ask the only question that matters: does it arrive with identity, and can it be joined to it? The tier falls out of that.
What the classification forces on the platform
Three architectural consequences, and they're the reason this post exists before the domain-model post.
Default to Tier 1; promote only through a logged join. Device events land in the Tier 1 store, full stop. A record is promoted to Tier 2 or Tier 3 only by an explicit join service, and only with a logged reason and actor. Promotion is never a side effect of a write. This inverts the usual instinct — most pipelines collect everything identifiable and lock it down later. We collect de-identified by default and earn our way up, per record, on the record. The audit log of promotions is, in effect, the map of where our regulatory exposure actually is.
No identity in device-event payloads — ever. The phone attaches a user ID to an event before posting it to the cloud; the device firmware never knows who owns it. This is a security property dressed as a privacy rule. Recall from the BLE work that the device exposes its data as GATT characteristics over a link any bonded phone can read. If a firmware bug ever leaks data through an unauthorized characteristic read, the worst case is Tier 1 — anonymous device telemetry, no person attached. The identity lives one hop away, on the phone, behind the app's auth. I expect that to be the single most useful property we have the first time someone runs a security audit against the brush.
The dentist portal is a separate subsystem, not a feature flag. The dentist-portal work (in design now) lives behind its own authentication and audit boundary, physically separate from the consumer app. A practice that signs a BAA can reach the consented patients' data through that door — and the consumer API can never expose those joins, because it has no code path to them. A BAA-gated flow you can turn on per-partner is the switch I mentioned up top; building it as a distinct subsystem is what makes the switch real instead of aspirational.
Encryption and isolation — what 2018 actually gives me
End to end: the device-to-phone leg is BLE-encrypted under the long-term key from bonding; the phone-to-cloud leg is TLS 1.2 over Wi-Fi or LTE. Inside the cloud, everything is encrypted at rest with KMS, and Tier 3 gets its own KMS key under stricter IAM.
The part worth being concrete about — because it's where the era bites — is where Tier 3 can live. AWS will sign a BAA, but only a subset of services are HIPAA-eligible, and the list in early 2018 is shorter than people assume. The managed primitives I'd reach for reflexively aren't all on it yet. Our high-volume Tier 1 event store leans on DynamoDB, which is not HIPAA-eligible at this writing — fine, because Tier 1 carries no PHI. But that means Tier 3 can't just be "the same store with a flag." Identifiable data goes into HIPAA-eligible services: RDS and S3 with encryption, on EC2 capacity we're allowed to run PHI on, under the signed BAA. The tiering isn't only a privacy model; it's the thing that lets the bulk of our data sit on the convenient, cheap, not-yet-eligible service while the small regulated slice sits on the eligible one. If I'd designed a single identifiable store, I'd have had to put all of it on the eligible subset and pay for that everywhere.
For HIPAA-business-associate flows specifically, Tier 3 lives in a separate AWS account from the rest of the platform, reached by cross-account IAM roles for the few services that need it. The account boundary is the strongest isolation primitive AWS offers — stronger than IAM policy alone, because a misconfigured policy in the main account can't reach across an account line it was never granted. As the dentist-partnership product spins up, that boundary is what I'll point an auditor at.
Retention, because someone always forgets it
Consumer data is retained for the life of the account. Tier 1 telemetry is kept 18 months, then rolled into monthly aggregates — small enough to keep forever, identifiable of nothing. Tier 3 under a BAA follows the BAA's terms, which for clinical records tend to land around seven years, with deletion on patient request through a documented process. Writing the deletion process down now matters more than it looks: the GDPR erasure right arriving in May means "we can delete a person on request" stops being a nice-to-have and becomes a thing I have to be able to demonstrate, across Tier 2 and Tier 3 both. A deletion you can't prove you performed is, to a regulator, a deletion you didn't perform.
What it cost me to get here
I'll name the mistake, because it shaped the whole tiering. My first instinct — straight from web-services habits — was to mint one durable user ID and stamp it on every event at ingestion, identity and all, then restrict reads later. Clean joins, simple pipeline, one ID to rule them all. The privacy office killed it in about ten minutes, and they were right: that design makes every event Tier 3 the moment it lands, drags the entire high-volume telemetry stream into the regulated, HIPAA-eligible, separate-account world, and means a single over-broad read grant exposes identifiable health data at fleet scale. I'd have been encrypting and isolating everything — paying the cost of PHI handling on millions of events that didn't need it — and still carrying more risk, not less, because the identifiable join was everywhere instead of nowhere.
The reframe that fixed it is the one principle I'd hand the next team: don't ask how sensitive a field is; ask what it's joined to, and default to joining it to nothing. Sensitivity is a property of relationships, not values. Build the pipeline so the default state of every datum is de-identified, make every promotion an explicit, logged act, and your regulatory surface shrinks to exactly the rows you chose to elevate — which is also exactly the set you can hand an auditor without a sweep.
What's next
All of this presumes one thing I haven't built yet: a clean model of which entities exist and which are allowed to touch which. You can't enforce "promote only through a logged join" if you have eight different definitions of "user" and "device" scattered across eight APIs — and that's exactly the portfolio I inherited. The next post takes on consolidating those eight device APIs into one entity domain model, because the classification I just spent four hours and twelve memo pages on is only as good as the domain it's enforced against.
From eight device APIs to one entity domain model
The connected-health portfolio I inherited has eight separate device APIs in production. Each has its own user model, its own session contract.
When I took the platform role in September 2017, the connected-health portfolio had eight separate device APIs in production. The adult toothbrush had one. The kids' brush had another. The interdental device had a third. There were three more for other body-care lines, plus two legacy services that were technically deprecated but still serving traffic. Each had been built by a different product team at a different time with a different stack.
The first quarter of my tenure was diagnostic. The second quarter — just finished — produced the consolidation plan. The next four quarters are the migration. This is what we're doing and how.
What "eight APIs" actually means
Each device line has:
- Its own definition of
User(email-only in some, email+phone+DOB in others, OAuth-federated in two). - Its own concept of
Session(the toothbrushes have brushing sessions, the interdental has flossing sessions, body-care has usage events). - Its own auth (some use OAuth 2.0 with the internal IDP, some use per-device tokens, two use custom HMAC headers).
- Its own analytics pipeline (some write to Redshift, some to a third-party warehouse, the legacy two write to flat files in S3).
- Its own SDK for the mobile app (each app team integrated a different SDK).
A single user with a brush and an interdental device has two accounts, two SDK integrations, and zero shared session history. From a product perspective, "see your full oral-care timeline in one place" is a feature request that doesn't fit without platform work first. From a billing-engineering perspective, deduplicating that user across systems for marketing is a quarterly fire drill.
The domain model we landed on
I led a six-week design exercise with senior engineers from each product line, plus product and privacy. We came out the other end (last month) with a shared entity model:
- Account: the human, with one identity (email + DOB as needed) and a privacy classification tier.
- Device: a physical object with a serial number, a hardware revision, a firmware version, and a product type. Devices belong to Accounts (via a join entity that supports transfer of ownership).
- Consumable: brush heads, in our context. A Consumable has a type, an attached-at timestamp, and a lifetime estimate. Consumables belong to Devices.
- Session: a discrete usage event. A Session has a Device, an optional Consumable, a start and duration, and a payload of measurements. Sessions belong to Accounts via the Device join.
- Event: a non-session occurrence. Battery low, firmware upgrade complete, device unbonded. Events are append-only and feed analytics.
Five entities. They cover every device line we have. They cover the ones the roadmap is adding through 2019.
The shape that matters is the spine: everything hangs off Account → Device, and Session and Event both anchor back to a real Account through the Device they came from. That's the property the old world didn't have — no event in any of the eight services could be reliably traced to a single human without a deduplication job. Here it's a foreign key.
The conversation that made it work
Domain consolidation is 20% modeling and 80% getting product teams to give up sovereignty. The conversation that broke the logjam, almost verbatim, in a room with the interdental product manager last month:
Me: "I can rebuild your service to use the shared domain model in eight weeks. You'll have one engineer to integrate the new SDK in the app, three weeks of effort. After that, every feature in the platform shows up in your product for free — joined timelines, shared analytics, the new dentist portal."
PM: "What do I lose?"
Me: "Eight weeks of headcount you weren't going to spend on this. And the ability to ship a one-off auth scheme next time you have a new device type."
PM: "I have never wanted to ship a one-off auth scheme."
That kind of trade has been repeatable. I have a slide that says "what you gain / what you lose / what it costs you" for every product line, and the column where I name what they actually give up is always shorter than the gain column once we've talked through it.
The API surface over the model
Settling the entities is the hard 80%. The shape of the API on top of them is the easier 20%, but it's where I had to make a call I expect to get second-guessed, so I'll show my work.
The platform exposes a resource-oriented REST API — /v2/accounts/{id}, /v2/devices/{serial}, /v2/devices/{serial}/sessions. Five entities, predictable nesting, JSON over HTTPS. The runtime is AWS API Gateway in front of Lambda, with the gateway doing JSON-schema request validation on the way in so a malformed body never reaches a function. Nothing exotic — that's the point. Every mobile engineer on every one of those app teams already knows how to consume REST, and the institutional muscle for caching, pagination, and versioning a REST surface is decades deep. For a platform whose first job is to stop being eight things, the boring choice is the correct one.
I did seriously look at the alternative. The "see your full oral-care timeline" feature is exactly the multi-entity, nested read that REST is clumsy at — fetching an Account, its Devices, each Device's recent Sessions, and the attached Consumable is four or five round trips or a pile of bespoke ?include= query params. GraphQL solves precisely that: the client asks for the graph it wants in one query, and our five-entity model is, almost literally, a graph already. Facebook open-sourced it in 2015 and the tooling is maturing fast; the mobile leads have been reading about it.
I'm not building on it yet, and the reason is risk, not taste. It's young, the server-side libraries are still churning, and — the real blocker — I can't put a brand-new query layer on the critical path of a migration whose entire selling point to the product teams is low risk, low effort. The move I'm actually making is to model the domain so a GraphQL layer could be laid over the same entities later without reshaping anything underneath. The entities are the contract; REST is just the first projection of them. If GraphQL is the right read API in 2019, the model won't have to change to get there.
The strangler-fig migration
Each existing API stays live during the migration. The new platform exposes equivalent endpoints under /v2/. The mobile SDKs are being updated to dual-write — every event posted goes to both the old service and the new platform. Reconciliation runs nightly. After 30 days of clean reconciliation per device line, we cut reads over to v2. After 90 days, we shut down the old write path.
The adult toothbrush is migrating first — biggest user base, most engineering investment available, most to gain from the new dentist portal feature. We're estimating 14 weeks end to end. The kids' brush is second; we expect 8 weeks (we should have learned by then). The interdental device is third; 6 weeks. The two legacy services will be drained by attrition — write-only adapters, no further investment, deprecated in the apps after one more release cycle.
Projected total elapsed: roughly 13 months from the design exercise to the last device line fully on the new platform. That puts the finish line around March 2019.
What this enables
Three things, in roughly the order they should pay off.
One: shared OTA. Once every device line is on the same platform, the OTA pipeline becomes a single product instead of N copies. We'll be able to ship firmware updates to the secondary devices using the primary pipeline with no additional infrastructure. The cost-per-device-line for new firmware features should drop to near-zero.
Two: the dentist portal. A multi-device timeline for a single patient becomes a one-week query, not a six-week integration. The dentist portal would be impossible at any reasonable cost on the old fragmented architecture.
Three: the next-device experience. When the product team adds the next brush model later this year, the API integration should be three days. Not three weeks. Not three months. The hardware team designs a new firmware build, sends us a sample device, and we onboard it through the existing API.
The cost
I won't pretend this is free. We'll burn about 2.5 FTE-years on the consolidation, drawn from the engineering team. The legacy services keep paging two of those engineers during their migrations. I've already lost one product manager partly because the migration delayed a feature they cared about by a quarter. Two security audits will have to be redone because the model changed mid-cycle.
The honest assessment is that the consolidation is the highest-leverage thing I'm doing in this role. It should pay for itself within the first year and compound every year after.
The takeaway for any platform leader
If you're inheriting a portfolio of connected products built at different times by different teams, the entity domain model is the highest-leverage place to invest. The hardware will refresh on its own cadence. The mobile apps will rewrite themselves on the front-end framework du jour. The cloud infrastructure will change vendors twice. The domain model — what is a User, what is a Device, what is a Session — is the thing that survives all of it.
If the model is wrong, every feature in the platform pays tax on it forever. If the model is right, the next ten years of product launches get cheaper.
The next post will be on the auth model we're designing to anchor every event in this domain to a real device and a real human — without the device ever talking to the cloud directly.
Phone-as-gateway — the auth model for BLE-only devices
A BLE-only health device can't authenticate to the cloud directly — the customer's phone has to carry its identity across. So how do you stop that phone from forging the device it's supposed to be speaking for?
The brush has no radio but Bluetooth. I wrote about what that does to the platform a while back — late, duplicated, out-of-order telemetry, all of it arriving through the customer's phone because there's no other door. That post ended on a promise I deferred: the device can't reach the cloud to prove who it is, so trust has to bridge two separate domains, and stitching them together is its own problem. This is that post.
Here's the problem stated plainly. Every byte the cloud ever sees about a device — every brushing session, every battery reading, eventually every firmware acknowledgement — arrives inside an HTTPS request made by the phone, not the device. The device signs nothing the cloud can check unless we make it. So the cloud is being asked to believe a claim of the form "a real Device #4471, bonded to me, recorded this session" — and the entity making that claim is a phone app we shipped to an app store, running on hardware we don't control, that a determined attacker can decompile, instrument, or replace outright. The phone is a forgeable middleman, and the whole auth model is about making its forgeries useless.
Two trust relationships that don't touch
Start by being precise about what trust we actually have, because there are two completely separate relationships here and the temptation is to treat them as one.
Relationship A: device ↔ phone, established by BLE bonding. The user pairs in the app; under LE Secure Connections the device and phone run a P-256 ECDH exchange and each cache a long-term key (LTK). After that the link is encrypted under the LTK, and each side recognizes the other on reconnect. The trust here is between two specific physical objects — this brush, this phone. It says nothing about which human is holding the phone, and the LTK never leaves either device, so the cloud has never seen it and can't use it.
Relationship B: phone ↔ cloud, established by OAuth 2.0. The user logs into their account in the app and gets back an access token — a bearer JWT in our case. The token authorizes API calls on that user's behalf. The trust here is between an authenticated human session and our backend. Standard mobile auth, and it says nothing about any particular device.
Look at where each relationship terminates and the gap jumps out. Bonding proves device-to-phone but dead-ends at the phone; the cloud isn't a party to it. The token proves human-to-cloud but says nothing about which device's data is riding along. And the device — the thing whose data we actually care about being authentic — is a party to exactly one of the two relationships and never once talks to the cloud. The phone is the only thing that sits in both domains. That's not a convenience. That's the attack surface.
The three questions the API has to answer
When an upload lands, the ingestion endpoint has to answer three questions, and — this is the whole point — it has to answer them with cryptography, not with policy or trust in the caller:
- Was this session actually recorded by a real device from our line? Not synthesized by an instrumented app, not crafted to inflate an engagement metric, not lifted from someone else's account and replayed.
- Was that device legitimately bonded to the account uploading it? Not a unit that was sold on, not one re-pointed at a stranger's account without going through a transfer.
- Is the human logged into this phone the one who owns the device? Not an ex-partner who knows the password, not a borrowed handset.
Each maps to one of the things we can actually establish. Q3 is the OAuth token — it's exactly what Relationship B proves. Q1 and Q2 are the hard ones, because the only entity in a position to assert them is the phone, and the phone is precisely what we can't trust. Answer those two without trusting the phone and the model holds.
Why bonding alone doesn't get you there
The seductive wrong answer — and the one I argued against in a design review, so I'll own having had to argue it — is "the phone is bonded to the device, the phone is authenticated to the cloud, therefore the cloud can trust what the phone says the device recorded." It chains the two relationships through the phone and calls it done.
It falls apart the moment you write down what an attacker controls. Bonding secures the Bluetooth link; it does not produce any artifact the cloud can verify. By the time session data is sitting in the phone's memory, it has already come out the far end of the encrypted BLE link in cleartext — the phone has to decrypt it to handle it. A modified app, or a script speaking our REST API directly with a valid token, can hand the cloud any session bytes it likes, stamped with any device serial it likes. The LTK doesn't help: it's a link key, not a signing key, the cloud doesn't have it, and even if it did, "this came over a bonded link" is a claim only the phone can make and the phone is the liar. Chaining the relationships through the phone just means the phone's word is load-bearing, which is the one thing we can't allow.
The lesson generalizes past Bluetooth: whatever sits between the device and the cloud is hostile by default — phone, hub, home gateway, doesn't matter. If the only thing vouching for the device's data is the box in the middle, you've authenticated the box, not the device.
What we built: sign on the device, verify in the cloud
The fix is to give the device a voice the phone can carry but can't fake. Every session is signed on the device, and the cloud verifies that signature against a chain that has nothing to do with the phone.
Each unit ships from the factory with its own keypair generated inside a hardware secure element — the same crypto co-processor the platform already relies on for the BLE pairing — and a per-device X.509 certificate signed by our manufacturing CA, with the device serial as the subject. The private key is generated on-chip and never leaves it; not in manufacturing test, not over BLE, not ever. (This is the same factory-PKI posture the connected-products line uses; I'm not reinventing it here, just pointing it at telemetry.)
When the device records a session, before it writes the record to flash it signs the payload — the session bytes plus the device's own monotonic counter — with that private key, using ECDSA on P-256. The signature and the device's certificate travel with the record. The phone drains the record over BLE exactly as before, and uploads it unmodified, wrapped in its own user-auth token. The phone can read the bytes. It cannot alter them without invalidating a signature it has no key to recompute.
On the cloud side, ingestion runs three checks against that one upload:
- The signature, against our CA. Does the certificate chain to our manufacturing root, and does the signature verify over the payload with the public key in that cert? Pass means a real device from our line produced these exact bytes — that's Q1, answered in math.
- The serial, against the account's bond set. The serial is baked into the signed certificate, so the phone can't lie about it. Is that serial in the set of devices currently bonded to this user's account? Pass means this device belongs to this user — that's Q2.
- The token, against the session. The OAuth token identifies the human. Pass means the account owner is the one uploading — that's Q3.
All three must pass or the event is rejected and logged for review. Notice what the phone's role has shrunk to: it's a pipe. It carries a signed blob it can't forge and a token that authenticates a human, and it gets no say in whether the cloud believes the device. That's exactly where you want a hostile middleman — load-bearing for delivery, irrelevant to trust.
Replay, while we're here
Signing the bytes stops forgery but not replay — a captured upload replays with a perfectly valid signature, because it is valid. Two things close that. The monotonic counter is inside the signed payload, so ingestion dedupes on (device-serial, counter) exactly as the replay-tolerant ingestion design already does for honest duplicates; a replayed session lands on a counter value the log has already accepted and is dropped. And the bond-set check means a session captured from one account can't be replayed into another — the serial won't be in the attacker's bond set. The work I'd already done to tolerate a flaky gateway turned out to be most of what I needed to tolerate a hostile one.
Why not just put the device on AWS IoT Core
We looked hard at AWS IoT Core — it's been GA since 2015 and it's the obvious place a question like "authenticate a device to a cloud" points you. The model it wants is clean: the device authenticates directly to the broker over mutual TLS with its device certificate and publishes MQTT to its own topic. For a device with its own internet radio, that's the right answer and I'd reach for it without hesitating.
Our device has no internet radio. It can't open a TLS socket to anything; the nearest IP-capable thing in its world is the phone. To use IoT Core we'd have the device speak MQTT to the phone, which relays to the broker — at which point the phone is doing all the work and the broker is an HTTPS endpoint with a flakier transport bolted in front. Worse, mTLS terminates at the phone, so the broker would authenticate the phone's TLS session, not the device's — which drops us right back into trusting the middleman, the exact thing we just spent a design eliminating.
So we kept ingestion as our own signed-event endpoint and borrowed the shape IoT Core would have given us: per-device certs and keys, an append-only log, idempotency keyed on a per-device counter, server-side dedup, every event attributable to a specific attested device. The wire protocol isn't MQTT and the front door isn't the broker, but the trust model is the one a broker would have enforced — pushed up to the application layer where, for a phone-gateway product, it actually belongs. If we ever ship a unit with Wi-Fi on board, the device can connect to a real broker and the cloud-side contract barely moves. (Azure IoT Hub and GCP's Cloud IoT Core have the same directly-connected-device assumption baked in; none of the managed brokers fit a BLE-only product until the hardware grows its own radio.)
The lost-phone problem
Bonding assumes the phone and device are a stable pair. They aren't — people replace a phone every couple of years, and the device long outlives any one handset. So there has to be a way to move a device's bond from an old phone to a new one without mailing the brush back to us. And that flow is a gift to an attacker if you build it wrong: if "re-point this device at my account" is a pure software operation, then anyone who phishes a user's credentials can remotely steal the device's data stream into their own account.
The flow we shipped puts a physical act in the middle of it:
- The user signs into the app on the new phone and sees their registered devices.
- They pick "re-pair this device" and the app prompts them to press and hold the button on the brush.
- The brush, only on that physical button-hold, deletes the old LTK and accepts a new bond.
- The phone tells the cloud the device is now bonded to this account.
- The cloud records the bond change, revokes the old phone's authorization to upload for that serial, and logs the transfer for security review.
Step 2 is the load-bearing one. Without it, an attacker with stolen credentials re-pairs from their own phone and starts uploading — and because their forged sessions would now carry a real device's signature relationship, the cloud might even believe them. With it, the attacker also needs to be physically holding the brush and pressing its button. A button-hold is a low-tech, high-assurance signal that no amount of software-side compromise can spoof, and it's the cheapest strong control on the whole platform.
What I'd tell a team
- Sign on the device, verify in the cloud, never trust the gateway. Whatever sits in the middle — phone, hub, edge box — is hostile by default. Give the device a cryptographic voice the gateway can carry but can't fake, and the gateway's trustworthiness stops mattering.
- Don't confuse a link key with an identity key. BLE bonding secures the radio between two objects; it proves nothing to a cloud that never saw the key. If the cloud needs to trust the device, the device needs a key the cloud can check.
- Bind authorization changes to a physical act. Re-pairing requires a button-hold on the hardware. The one control a remote attacker can't satisfy is the one that needs hands on the device.
- Build ingestion as if it were an IoT broker even when it can't be one. Per-device attestation, append-only, idempotent on the device's own counter. The shape is right regardless of whether the wire protocol ever becomes MQTT — and it ports cleanly the day the hardware gets a radio.
What's next
The same trust path I just described carries telemetry up. The harder direction is down: pushing a firmware image through that hostile phone and onto the device without ever letting the phone substitute its own. Everything here gets stress-tested when the payload stops being a session record and starts being executable code. That's the next post.
OTA firmware over Bluetooth — pushing the ROM through the phone
The hardest single problem on the connected platform is firmware updates. The device has no Wi-Fi, no internet, no way to download anything on its own — so a new ROM has to crawl through the customer's phone, 20 bytes at a time, without ever being trusted to brick the device or to arrive unsigned.
The single hardest problem on our connected-health platform is over-the-air firmware updates — and it isn't close. Not because firmware is hard; it isn't, particularly. Because the topology is hostile on every axis at once, and a firmware push is where all of it converges:
- The device has no internet. It can't download anything on its own — the only radio it has is BLE, and the only thing in BLE range is the customer's phone.
- The phone has internet, but only sometimes. It's the user's phone, not a gateway we control, and it's foregrounded near the device for maybe four minutes a day.
- BLE 4.2 throughput is ~10 kbps practical in our install base. A 200 KB image is three to four minutes of continuous transfer — and we almost never get three to four uninterrupted minutes.
- The user walks out of range mid-transfer. Closes the app. Lets their battery die. Lets the device battery die. Every one of those is the common case, not the tail.
- And the phone in the middle is hostile by default. When the payload was a session record, a forged one inflated a metric. When the payload is executable code, a forged one runs on the device. The stakes just changed completely.
A failed update on a deployed unit is a support call, a return, and a one-star review. We've shipped about a million units. So the design assumption, stated up front: every transfer will be interrupted, and every image might be hostile. The system is built to be unsurprised by both.
Where this sits in the series
The two posts before this one set the table. The BLE-4.2 post established the physics — no Wi-Fi on the device, a 20-byte MTU, a flaky phone as the sole gateway — and the phone-as-gateway auth post established the trust model: sign on the device, verify in the cloud, never trust the thing in the middle. That post carried telemetry up through the hostile phone. This one is the harder direction — pushing a new ROM down through that same phone — and it inherits both problems. The 20-byte pipe makes the transfer slow and interruptible; the untrusted phone means the device cannot take the bytes it's handed on faith.
The flow, end to end
An OTA update moves through seven stages:
- Cloud build and sign. Engineering builds an image and signs it with our firmware-signing key — the same private key whose public half is burned into every unit at the factory. The signed image lands in a versioned artifact store (S3, behind our platform API) tagged with target hardware revision and version.
- Cohort selection. The platform decides which units are eligible — by hardware rev, current firmware version, region, and canary tier. Nobody gets an update because they asked; they get it because the cohort logic released it to them.
- Phone fetch. On its next sync, the app learns the bonded device has an update waiting and downloads the signed image over HTTPS — even if the device isn't in range right then. The phone caches it. This decouples the slow internet fetch from the slow Bluetooth push.
- Transfer. Next time the user opens the app with the device in range, the app offers the update. They tap install, and the phone streams the image into the device's staging flash bank, one chunk at a time.
- Verify, then commit. The device receives the whole image into its second bank, checks a SHA-256 over it against the manifest, then verifies the signature against the on-chip key. Only if both pass does it set "boot the new bank next time" and reboot. Verify first; commit second. Never the other way around.
- Boot and attest. The device boots the new bank and, within the first seconds, sends a "booted clean, version X.Y.Z" up the same signed-telemetry path the session records use. The phone relays it to the cloud.
- Roll back if it doesn't check in. If that attestation never arrives within a few boot cycles, the bootloader concludes the new bank is bad and reverts to the old one — with no app, no phone, and no user involvement.
Laid out across the three tiers it touches, the path looks like this:
Stages 5, 6, and 7 are the entire reason a botched push doesn't become a brick. The rest is plumbing; those three are the safety system.
We didn't invent the bootloader — Nordic did
Worth being honest about what we built versus what we bought. The brush runs a Nordic nRF52-series SoC, and Nordic's nRF5 SDK ships a secure bootloader with background DFU that already does the load-bearing work: a dual-bank flash layout, a bootloader region that an update never touches, and — critically — a bootloader that refuses to activate an image unless it's signed with the key we provisioned. We didn't reinvent that. We configured it, provisioned our signing key into it, and wrote the mobile and cloud halves around it.
The dual-bank layout is the whole game. Flash is divided so that Bank 0 holds the running application and Bank 1 receives the incoming image. The current firmware keeps running, untouched, the entire time the new image is crawling in over Bluetooth. Nothing about the live device degrades during a transfer that might take days of stop-and-start. Only after Bank 1 is complete, hashed, and signature-checked does the bootloader swap which bank is active. If anything goes wrong before that swap — and something usually does — Bank 0 was never disturbed, so the device just keeps running the old firmware.
And the bootloader is sacred. We never overwrite it from an OTA. Ever. It's the one piece of firmware programmed at the factory and never replaced in the field, because it's the fallback that gets us out of every other firmware bug. If a bug in the application bricks the app, the bootloader still runs, still verifies, still rolls back. If we ever had to update the bootloader itself, we'd do it as a service-mode operation at retail — but we've never had to, and I don't plan to design for it. An OTA that can rewrite its own safety net isn't a safety net.
Chunking a ROM through a 20-byte straw
Now the transfer itself. The image goes over BLE in chunks sized to the connection's negotiated MTU. With the LE Data Length Extension — a BLE 4.2 feature — a willing phone gives us up to ~240 useful payload bytes per packet. Plenty of the older Android handsets in our base won't negotiate DLE, and there we're stuck at the BLE default: a 23-byte ATT MTU, three bytes of which are header, leaving 20 bytes of payload. Twenty. For a 200 KB image, that is ten thousand packets in the worst case, and we design against the worst case.
Each chunk carries three things:
- a 4-byte sequence index, so the device can tell exactly which chunk it's looking at and detect a gap;
- a payload sized to the negotiated MTU (20 to ~240 bytes);
- a CRC-16 over the chunk, so a corrupted packet is caught immediately rather than poisoning the image silently.
The device acks every chunk. A good chunk gets an ack-and-advance; a chunk whose CRC fails gets a nack, and the phone retransmits that one chunk. After three failed retransmits the phone gives up and declares the connection dead for now — it doesn't thrash forever on a link that's clearly gone. The CRC is the cheap, fast line of defense at the packet level; the SHA-256 at the end is the expensive, thorough one over the whole image. Two layers, two jobs.
Resume, because nobody finishes in one sitting
A 20-byte pipe means a transfer can stretch across many sessions, and the user has no idea a transfer is even underway. So the device tracks a persisted write cursor: the highest contiguous chunk index it has safely committed to Bank 1. When the link drops — range, app close, dead phone — the partial image just sits in Bank 1, harmless, because Bank 1 isn't the running firmware. On the next connection the phone asks the device "what's your cursor?" and resumes from the next chunk rather than restarting from zero. On a clean link the whole thing might finish in one ~3-minute window; in the field it's far more often three or four windows spread over days. Without resume, a device that never gets one uninterrupted window would never update. Resume is what makes a slow, interruptible pipe converge.
Verify, then commit — the power-loss-safe core
Here's the discipline that keeps a half-written update from being a brick, stated as a rule: the device never makes a staged image bootable until it has proven the image is both complete and authentic, and the running firmware is never touched until that moment. Walk it step by step:
- Chunks accumulate in Bank 1. Bank 0 keeps running. The active-bank pointer still says Bank 0.
- The last chunk arrives. The device computes a SHA-256 over all of Bank 1 and compares it to the hash in the signed manifest. Mismatch — even one bad bit the CRCs somehow let through — and Bank 1 is discarded. Nothing else happens.
- The hash matches, so the device verifies the image's signature against the public key burned into the chip. (More on why that step is non-negotiable in the next section.) Fail, and Bank 1 is discarded.
- Both pass. Now — and only now — the device flips the active-bank pointer to Bank 1 and reboots. This pointer flip is the single atomic commit. Before it, the device boots old firmware; after it, new. There is no in-between state where the device boots a half-written image, because the pointer is never set until the image behind it is whole and trusted.
That ordering is the whole power-loss story. Lose power at step 1 or 2 and Bank 0 is pristine — the device boots the old firmware and discards the partial Bank 1 on the next attempt. Lose power during the reboot at step 4 and the bootloader, on its next run, sees the new pointer and a fully-verified Bank 1, and proceeds. The dangerous operation — making code bootable — is reduced to flipping one flag, the fastest, most atomic thing the device does, and it happens only after every check has passed.
Signing — the phone is carrying executable code now
This is the step the phone-as-gateway post makes unavoidable. We already concluded that the phone is a forgeable middleman that can hand the cloud any session bytes it likes. The same phone is now handing the device a firmware image. If the device trusts what it's given, an attacker who controls the app — decompile it, re-sign it, point it at a malicious image, or just drive the BLE characteristics directly — can flash arbitrary code onto a million medical-adjacent devices. That is the worst outcome on the whole platform, by a wide margin, and it is precisely the outcome the phone's untrustworthiness invites.
The defense is the mirror image of what we did for telemetry. There, the device signs and the cloud verifies. Here, the cloud (our build infrastructure) signs and the device verifies:
- Engineering signs every image with the firmware-signing private key, which lives in an HSM in our build pipeline and is never on anyone's laptop. A leak of that key is a fleet-wide extinction event — it would let an attacker sign images every unit we ever shipped would trust — so it's guarded like the crown jewel it is.
- The matching public key is burned into every device at the factory, in the same secure element the BLE pairing already uses. Nordic's secure bootloader checks against it, in the bootloader region the OTA can never overwrite.
- The phone never touches either key. It carries a signed blob it cannot alter without invalidating a signature it has no key to recompute. It is, once again, demoted to a pipe — load-bearing for delivery, irrelevant to trust.
So the device's question isn't "did the phone give me this?" — the phone's word is worthless and we've stopped asking for it. The question is "is this image signed by us?" The signature, not the courier, is what the device trusts. A malicious image flashed by a compromised app fails the signature check and gets discarded at step 3 above, exactly like a corrupt one. The phone can refuse to deliver an update, or deliver a stale one — denial of service we can live with — but it cannot make the device run code we didn't sign.
Every interruption, and what the device does about it
Because the design assumes interruption, each failure mode has a defined, boring outcome — boring is the goal:
- User closes the app mid-transfer. Phone stops. Partial image sits in Bank 1, inert. Next launch, the transfer resumes from the persisted cursor. No damage.
- BLE drops out of range. Identical. The cursor survives; the next in-range window picks up where it left off.
- Phone battery dies. Identical again. The device doesn't even know the phone is gone until the next connection.
- Device battery dies mid-transfer. The device's RAM is lost, but Bank 0 is untouched, so on next charge it boots the old firmware normally. Bank 1 may hold a partial image; the device sees it's incomplete (cursor short of the manifest length) and either resumes or discards and restarts. Bank 0 was never at risk.
- SHA-256 mismatch after the last chunk. Bank 1 discarded; an event logged for the cloud to see. Device keeps running Bank 0.
- Signature check fails. Same as a hash mismatch, but it's the alarm bell, not the shrug — a signature failure on a complete image with a good hash is the fingerprint of a substitution attempt, and it's logged as a security event, not a transfer error.
- New firmware boots but hangs or crashes. The bootloader's boot-counter is the net here: the new firmware has a fixed number of boot cycles to send its "booted clean" attestation. If it doesn't check in — because it crashed, hung, or otherwise misbehaved — the bootloader swaps back to Bank 0 automatically. The user might see a flaky device for a few minutes; they never see a dead one.
The boot-counter failsafe deserves emphasis because it catches the class of failure no transfer check can: an image that arrives perfectly, verifies perfectly, and is broken anyway. A clean signature says the code is authentically ours; it says nothing about whether the code works. Those are different guarantees, and you need a separate mechanism for each.
The near-disaster
We had one genuine scare last month, and it's the most instructive thing that's happened on this platform, so I'll tell it straight.
We shipped an OTA to about 2% of the fleet as a canary. It booted fine in the lab. It booted fine on the canary units — at first. But it had a subtle interaction with one specific hardware revision: a sensor-calibration routine read a region of flash that was uninitialized on that rev, and the garbage it found there crashed the firmware roughly three days after boot. Not at boot — three days in. Every transfer check passed. Every signature verified. The boot-counter failsafe didn't fire, because the firmware did boot clean and did check in; it died long after the boot-counter had been satisfied and forgotten.
We caught it because the platform tracks "percent of cohort still online 72 hours post-update, faceted by hardware revision." That number started falling for the canary cohort on day three, and only for the one affected rev.
We froze the broader rollout, identified the rev, shipped a corrected image to the canary cohort, and the bootloader rolled back the units that had already crashed. Net damage: a few thousand units flaky for a day, zero permanent bricks.
The lesson is uncomfortable: a transfer that completes successfully and bricks the device three days later is a worse failure than one that never starts — because everything upstream reports green. Transfer success, hash match, signature valid, boot-counter satisfied: all true, all useless against a time-delayed crash. The only thing that caught it was measuring fleet health over time, not update success at the moment of update. That dashboard is the single most valuable piece of operational tooling we've built, and it's the kind of thing that's obvious only after it's saved you once.
The phone-as-gateway penalty, named honestly
Everything here is harder than it would be on a device with its own internet radio. AWS shipped IoT Jobs for exactly this — OTA orchestration with cohorts, staged rollouts, retries, monitoring, and, as of late last year, code-signed jobs so the device verifies the file before it runs it. If our device could open a TLS socket to IoT Core, I'd use Jobs and delete most of what we wrote. (Azure IoT Hub has device twins and its own update story; either way, a directly connected device.)
Ours can't. The nearest IP-capable thing in its world is the phone, and routing Jobs through the phone just puts the untrusted middleman back in the trust path — the exact thing we spent the last post eliminating. So we built our own: roughly 18 engineer-months across firmware, mobile, and platform. The payoff is a system that has pushed firmware to about a million units with low-double-digit permanent bricks — and most of those weren't our fault, they were clinical-pilot environments where adjacent equipment was stomping on the 2.4 GHz band hard enough to kill the BLE link past any retry budget.
What I borrowed from IoT Jobs without using it: the shape. Versioned signed artifacts, cohort rollouts, per-device attestation, server-side health tracking. The wire protocol isn't theirs and the orchestration is ours, but the model is the one a managed service would have enforced — pushed up a layer to where, for a phone-gateway product, it has to live anyway. If we ever ship a unit with Wi-Fi on board, the device connects to IoT Core and most of this collapses into configuration.
What I'd tell a team
Four principles, earned the hard way:
- The bootloader is the load-bearing piece — never let an OTA touch it. Factory-program it, treat it as the immutable fallback that survives every other firmware bug, and resist every clever argument for making it field-updatable. The thing that rescues you from a bad update cannot itself be delivered by an update.
- Verify, then commit — and make the commit a single atomic flag flip. Stage into a second bank, hash it, check the signature, and only then point the device at it. The running firmware is never disturbed until the new image has earned it. That ordering, not luck, is what makes a power loss mid-write a non-event.
- The device verifies the signature, not the courier. The phone — or any middleman — is hostile by default. Sign images in the cloud, burn the public key into the chip, and the device trusts the math instead of the messenger. A compromised app can withhold an update; it cannot forge one.
- Measure fleet health over time, not transfer success. A clean signature proves the code is yours, not that it works. Track percent-still-online N hours post-update, faceted by hardware rev, and let a slow-burn regression show up before the full rollout does. It's the only instrument that catches the failure that reports green.
What's next
The OTA system is the most extreme example of why the firmware, hardware, and API teams have to negotiate specs together rather than over a wall — a dual-bank layout, a signing key in the bootloader, a boot-counter contract, and a manifest format all have to be agreed before a single line ships, and none of them belong to one team. The next post is the cross-functional one: how hardware specs and API contracts get hammered out across those teams, with the brush-head identification protocol as the running example.
Where hardware specs meet API contracts — the room
A connected health product is two teams on two clocks — hardware on an 18-month cadence, the API platform on a two-week one — reconciling a 20-byte BLE pipe against a data model that wants to be rich. The room where they negotiate is the most important meeting on my calendar.
There's a recurring sixty-minute meeting on my calendar that I'd defend before almost any other. Four people, one from each engineering discipline that has to ship a connected-health feature: hardware, firmware, the API platform I run, and mobile. We call it the device platform sync, and what actually happens in that room is a negotiation between a 20-byte radio and a data model that wants to be rich — between what the silicon can physically do and what the app expects to read.
Get that negotiation right and a feature ships on time across four teams moving at wildly different speeds. Get it wrong and you ship a starved device — hardware that can't feed the data model — or a starved data model — a contract that throws away most of what the sensor measured. On a connected-health platform, where a session record might one day be quoted to a clinician, neither failure is cosmetic.
Four disciplines, four clocks
Start with the structural problem, because it's the thing the room exists to manage. The same feature has to ship from four teams whose release cadences span two orders of magnitude:
- Hardware engineering. Picks the chips, lays out the PCB, designs the physical product. Once a board is in production it is frozen — the next revision is 18 to 24 months out.
- Firmware engineering. Writes the code on the device. Ships maybe twice a year, and every update has to survive being pushed through a phone over Bluetooth to a fleet that's offline 23 hours a day.
- API platform. Owns the cloud contract the device and app both speak to. Ships every two weeks.
- Mobile. Builds the iOS and Android apps. Two-week cadence, gated by app-store review.
A feature only ships when all four agree on what it does. The disagreements are never about effort or intent; they're about contracts — the byte format of a sensor reading, the UUID of a BLE characteristic, the JSON shape the app deserializes. And the contracts are where a slow physical reality collides with a fast software one.
The constraint that starts every argument: 20 bytes
Almost every reconciliation on this platform traces back to one number. The brush has no radio but Bluetooth Low Energy 4.2, and the default ATT payload is 20 bytes — 23-byte MTU minus three bytes of header. The Data Length Extension can negotiate it higher, but a large share of the older Android phones in our install base won't, so 20 bytes is the number we design the contract against.
Here is what that does to a real feature. The product team wants a brushing-session record that carries, per session:
- coverage by mouth quadrant (where you brushed, where you missed),
- a pressure track (were you scrubbing too hard, and when),
- duration, motion summary, and the head ID of the brush head in use.
Modeled the way the API platform wants to model it — the way you'd model it if the device handed you JSON — that's a fat document. Streamed over a 20-byte notify pipe, it's hundreds of packets per two-minute session, every one of them costing radio-on time against a small battery, every one a chance for the phone to miss a notification.
So the room makes a trade, and both sides give something up.
Firmware gives up self-describing payloads. The device does not send JSON. It sends a fixed-layout binary frame — bit-packed, no field names, no delimiters. Quadrant coverage is a byte of flags. Pressure is a small array of decimated samples, not the raw track. Everything is positional: byte 0 is the message type and version, byte 1 is a flag field, bytes 2–3 are a duration, and so on. Twenty bytes buys you a surprising amount once you stop spending them on the names of things.
The API platform gives up "the device sends me my contract." The wire format and the app-facing contract are now two different things, and the platform owns the seam between them. The device speaks compact binary; the cloud expands that frame into the rich, self-describing JSON the mobile app actually reads. The app never sees a bit-packed byte. The device never sees a field name. The expansion logic — the canonical map from frame layout vN to JSON schema vN — lives in exactly one place, the platform, and that turns out to be the whole game.
The blunt version of the lesson: on a BLE product the wire format is a hardware artifact and the API contract is a software artifact, and pretending they're the same thing is how you starve one side to feed the other. Keep them separate, own the mapping, and each side gets to be good at its own job.
The other constraints, and what each one costs the contract
The 20-byte MTU is the loud one, but three more hardware specs reach straight into the data model. The pattern is always the same — a physical limit the hardware can't move after production forces a concession in the contract:
Flash size vs. "I can always backfill." The session buffer lives in a few kilobytes of on-device flash — a ring buffer of recent sessions, because the part is small and most of it belongs to firmware. The cloud model was originally written assuming it could always replay every session a device ever recorded. It can't: brush all week with the app closed, and the oldest sessions roll off the ring before the phone ever drains them. So the contract grew a distinction it didn't have at first — a record is either complete or partial-with-known-gaps, and the device reports the lowest counter it still holds so the cloud can tell "I have everything" from "the device had already overwritten sessions 4 through 9." Hardware couldn't grow the flash on a shipped unit; the API gave up the fiction of total recall and learned to represent a hole.
Battery budget vs. sampling rate. Every sample the sensor takes, and every minute the radio is on to sync, draws down a small cell. Mobile and product wanted the highest-resolution motion track the sensor could produce. The hardware budget said no — at that rate the published battery life misses its number, and battery life is on the box. The reconciliation happened on the device: sample high locally for the real-time in-app experience, but decimate before transmit so the synced record is a downsampled summary, not the raw track. The contract carries the summary. Product gave up server-side high-resolution analytics; the battery spec won, because a brush that dies early is a return.
Clock drift vs. "trust the timestamp." A device this small doesn't always carry a battery-backed real-time clock, so "now" on the brush is a guess between syncs. The contract can't trust the device's wall-clock time as truth; it carries the device's own monotonic counter as the ordering key and treats the timestamp as a hint to be corrected server-side. Hardware gave up nothing it could afford (an RTC is parts cost and board space); the data model absorbed the uncertainty.
There's a through-line in that table worth saying out loud. In every row, the hardware constraint is the fixed point and the contract is what bends. That's not a hierarchy of importance; it's a hierarchy of what can still change. The board is frozen the day it goes to production. The contract ships every two weeks. When two things have to agree and only one of them can move, the one that can move is the one that moves.
The bridge between the two clocks: capabilities, not versions-in-lockstep
So how does a platform that ships every two weeks live on top of hardware that ships every 18 months without one constantly breaking the other? Two rules, and they're the most important decisions in the whole arrangement.
Rule one: contracts get versioned, never changed. If the session-frame layout was agreed at v1, you do not change v1. You ship v2 alongside it. Hardware in the field running v1 firmware keeps emitting the v1 frame; the cloud keeps expanding it, forever, until the v1 fleet is small enough to deprecate — which in connected hardware means five-plus years. This rule is unpopular the first time you state it and invisible after a year, because the alternative — mutating a contract the field already depends on — bricks the meaning of data from a million devices you can't recall.
Rule two: the device and cloud negotiate a capabilities set. On first connect after pairing, the device announces what it can do — session_frame_v1, pressure_track, head_id_v1. The cloud announces what it understands — which may be a superset or, mid-rollout, a slightly different set. Both sides then operate strictly on the intersection.
That intersection is the bridge. It buys three things the cadence mismatch otherwise makes impossible:
- The API can lead. The platform ships
session_frame_v2support today; no device announces it yet, so nothing uses it. The day v2 firmware reaches the field, the same fleet starts lighting up the v2 path with zero cloud deploy. New analytics, new aggregations, new dentist-portal views that need no firmware change ship at cloud speed. - The firmware can lead too. A new hardware revision can announce a capability the cloud has never heard of. The cloud ignores unknown capabilities gracefully rather than erroring, so firmware doesn't have to wait for a coordinated cloud release to ship.
- Joint features wait for the slow team, deliberately. A feature needing both new firmware and new cloud ships when the firmware does. The platform builds its half early, feature-flagged, and the capability announcement is the flag.
Negotiating that bitmap into existence cost about six weeks of design and argument. It has paid for itself many times over, because it's the single mechanism that lets the fast clock run without dragging the slow one or getting dragged by it.
Who owns the gray zone
The boundary between firmware and the API platform is full of work that belongs to no one by default. Whose code parses the head-ID bytes? Whose code computes "hours since this head was attached"? Whose code decides it's time to tell the user to replace it? Left unowned, that logic ends up smeared across firmware, cloud, and the mobile app — three codebases, three subtly different versions of the same calculation, and a bug that only appears when they disagree.
The convention I hold the room to: the layer closest to the data owns its canonical interpretation. The head-ID byte format belongs to firmware, because firmware is what physically reads the chip on the head. "Hours since attached" belongs to the cloud, because the cloud is what aggregates sessions across time. The user-facing "replace your head" decision belongs to the cloud for the push notification and to mobile for the in-app surface. Whoever owns an interpretation owns the contract that documents it, and owns the migration plan when it changes. One interpretation, one owner, one place a fix lands.
What the product manager is doing in the room
The PM in that meeting isn't there to request features. They're there as the arbiter of what the user sees, and when — the one decision the four engineering teams can't resolve among themselves because it isn't an engineering question.
The sharpest example we hit: what should happen when a brush head reports no ID — an off-brand head, or a chip that didn't read? Fail closed (no ID, don't record the session) or fail open (record it, mark the head unknown)? That's not firmware's call or the platform's; it's a product tradeoff about real users. We chose fail open, mark unknown — people do brush with off-brand heads, and refusing to count those sessions is a worse experience than counting them imperfectly. With a senior PM in the room that's a five-minute decision. Without one it's a five-week stall while four teams guess at a question that was never theirs to answer.
What it cost me to learn the seam matters
I'll name the mistake, because it cost a firmware release we couldn't take back for months. Early on, before the binary-frame-versus-JSON discipline was settled, we let a single field get defined twice — the head-replacement threshold lived in a firmware constant and in a cloud config, because at the time it was easier to ship that way than to decide who owned it.
They drifted. Firmware on one product line said a given head type was good for 90 days; the cloud, updated later with a revised recommendation, said 100. The app showed one number from a cached firmware value on one screen and the cloud's number on another, and a careful user noticed their brush head was apparently due for replacement and not-yet-due at the same time.
For a body-care product that's an embarrassing inconsistency. For the regulated parts of this platform, a number that means two things at once is the kind of defect that ends up in front of an auditor.
The cloud side we fixed in a two-week sprint. The firmware constant was baked into units already in the field — we couldn't correct it until the next OTA, and we couldn't ship that OTA early just for this. So the platform absorbed the fix the only way it could: it learned to override the stale firmware value, treating the field device's number as a hint and the cloud's as canonical. Which is exactly the closest-layer-owns-the-data rule, learned the expensive way instead of agreed up front. The lesson I'd hand the next platform lead: a value that lives in two layers will drift, and the layer you can't update on demand is the one that will be wrong. Decide the owner before you ship, not after the field disagrees with itself.
What I want to carry forward
Four principles from running the room where the silicon meets the schema:
- Keep the wire format and the API contract separate, and own the seam. The device's job is to fit through the radio; the contract's job is to be good to consume. One team owning the expansion between them is what lets each side optimize for its own constraint instead of compromising both.
- In every spec-versus-contract fight, the frozen thing wins and the shippable thing bends. The board is fixed at production. The contract ships every two weeks. Design the contract to absorb the hardware's limits, not the other way around.
- Capabilities, not lockstep versions. A negotiated capability intersection is the only mechanism I've found that lets a two-week cloud and an 18-month fleet each move at their own speed without breaking each other.
- Closest layer to the data owns the interpretation. Don't let the meaning of a datum live in three codebases. It will drift, and the copy you can't hot-fix is the one that will be wrong in the field.
The retrospective that closes this series steps back from the individual decisions to the whole arc — what the platform got right over two years, what it got wrong, and what I'd undo if I were starting it again.
Two years on medical IoT — the platform retrospective
September 2017 to September 2019 — timeline of building the API platform behind a connected-health portfolio, things I got right, things I got wrong.
Two years ago this week I took a role I didn't fully understand the shape of: leading the API platform behind a connected-health portfolio. The job title said "platform." The org chart said "cloud team." What it turned out to be was the thing that holds a connected hardware product together for its entire life — the layer the toothbrush phones home to, the place every brushing session lands, the contract four other teams negotiate against. It was my first time owning that layer end to end, and I'm leaving it next week.
So this is the post I'd want to read if I were about to take the same job. Not a victory lap — a ledger. Two years, eight starting APIs, one platform, roughly a million devices on it by the time I'm walking out the door. What the architecture got right, what it got wrong, what I'd undo, and the things I was sure were mistakes that turned out fine. The one sentence I keep coming back to: the boundary between "platform engineering" and "IoT engineering" is mostly fictional, and I spent two years finding that out the long way.
The two-year arc
It's worth laying the whole thing out on one line first, because the shape of the arc is the argument. Thirteen of the twenty-four months were platform plumbing that shipped no new customer feature. That was the bet, and everything good downstream paid out of it.
Q4 2017 — diagnosis. Eight separate device APIs in production. Five had their own definition of "user." Two were formally deprecated and still serving live traffic. My first quarter was spent reading code I hadn't written and writing the memo that said, out loud, this is not a portfolio, it's eight products wearing a trench coat. The hardest part of the quarter wasn't the diagnosis; it was making the consolidation case to product leadership three separate times before it stuck.
Q1 2018 — privacy classification. Four hours in a conference room with the privacy office and a printout of the device payload, one field at a time, produced the three-tier data model. I went in thinking I'd get a yes/no per field and came out understanding that what regulates a datum is the claim and the join, not the bytes. That memo set the architecture for everything after it — every storage decision downstream inherited the rule "default to de-identified, promote only through a logged join."
Q2 2018 — the entity domain model. A six-week design exercise across every product line, ending in five entities: Account, Device, Consumable, Session, Event. Five entities that covered every device line we had and the ones the roadmap was adding. The adult-brush migration started the same quarter.
Q3 2018 — the auth model. Designed and shipped phone-as-gateway: per-device signing keys, BLE bonding for device-to-phone trust, OAuth for human-to-cloud, and a cloud that verifies what the phone can only carry. A BLE-only device can't authenticate to the cloud directly, so the whole scheme is about making a forgeable middleman's forgeries useless. This was the operational spine.
Q4 2018 — the first line on the new platform. The adult brush, biggest install base, migrated first. Fourteen weeks of strangler-fig work — dual-write to old and new, nightly reconciliation, thirty clean days, then cut reads over to v2.
Q1 2019 — OTA goes live. Shipped the over-the-air firmware pipeline — signed images pushed through the phone, dual-bank flash, verify-then-commit. This is the quarter of the canary near-disaster, which I'll come back to, because it's the best argument in here for a thing I almost cut.
Q2 2019 — the rest of the fleet. Kids' brush and interdental migrated, both faster than the first because we'd learned. The dentist-portal feature started — a thing that would have been a six-month integration on the old architecture and was a one-week query against the new one.
Q3 2019 — maturity, and the door. The weekly device-platform sync had been running a year, and the cadence mismatch between an 18-month hardware clock and a two-week cloud clock had stopped being a chronic source of pain. Roughly a million devices on the new platform. A version of this architecture is still running. And I started planning my exit, which is the most honest signal I can give that the platform no longer needed me to hold it up.
What I got right
Consolidating the entity domain model before shipping a single feature. This was the bet the whole arc turned on, and it was deeply unpopular at the time. Thirteen months of platform work, no new customer-visible feature, while product leadership watched the roadmap sit still. I made the case three times before it took. The reframe that finally landed wasn't "this is good engineering hygiene" — nobody funds hygiene — it was "the dentist portal is a six-month integration today and a one-week query the day this is done." Put a feature they wanted on the other side of the bet and the bet sells itself. It paid for itself inside a year and compounded after. The teams that win on platform investment are the ones that take the unsexy bet early, and the only way to get the org to take it is to name the sexy thing it unlocks.
Forcing OTA to be production-grade before it scaled — the canary near-disaster. This is the one I'd point to first if someone asked what discipline bought us. In Q4 2018, planning the first big rollout, there was real pressure to ship OTA at half-quality — transfer the image, flash it, done — and bolt on the safety later. We didn't. We built the dual-bank verify-then-commit, the boot-counter rollback, and a post-update fleet-health dashboard that watched devices after they'd taken the update, not just whether the bytes arrived.
Q1 2019, first real canary cohort, the dashboard lit up: devices were taking the update, reporting transfer success, and then going quiet over the next few hours. A firmware bug that only manifested under a specific post-boot condition the bench tests never hit. Because we were monitoring fleet health and not just transfer success, we caught it at a few hundred devices and halted the rollout. The version of this story where we shipped "transfer success = done" is the version where we push that image to the whole fleet over a weekend and brick twenty thousand toothbrushes in people's bathrooms. The safety work I almost cut is the only reason that sentence is hypothetical.
Bonding trust to a physical event. Re-pairing a device to a new account required a physical button-press on the device itself. The customer-experience team hated it — they wanted seamless, tap-to-transfer ownership — and they had a real point about the friction. I held the line anyway, because the alternative is a remote re-bond path, and a remote re-bond path is an account-takeover vector you ship to a million homes. No security audit in two years ever turned up a remote-rebond surface, because there wasn't one to find. Friction in the right place is a feature.
Treating the phone as a flaky courier, not a trusted client. Sign on the device, verify in the cloud, trust the phone for nothing load-bearing. The phone is a thing we shipped to an app store, running on hardware we don't control, that an attacker can decompile — so the whole auth model is built to make its forgeries useless rather than to trust it not to forge. This is the single principle I've carried, unchanged, onto every connected-product platform I've touched since. It travels because it's not about phones; it's about never putting trust in a hop you don't control.
What I'd undo
There's a through-line in this column I didn't see until I wrote it down: every one of these is a version of I wasn't in the room early enough. The mistakes weren't bad calls. They were calls I didn't get to make because I showed up after they were already made.
Letting hardware design the head-ID byte format with no API input. The byte format the chip used to identify a brush head was settled by the hardware team in mid-2017, before I started — frozen into silicon by the time I read it. When the platform went to model the Consumable entity in Q2 2018, the format fought us: no embedded version number in the head ID, no manufacturer field, no lot code. All things cloud-side analytics wanted and couldn't have, because the bytes were already shipping in the field and you can't change a contract a million units depend on. We wrote workarounds — a side table, an inference heuristic, two engineer-months of it. Two more bytes in that frame in 2017 would have erased all of it. The lesson is blunt and I've held to it since: be in the hardware spec meeting from week one, because the cheapest field in the world is the one you ask for before the board is laid out, and the most expensive is the one you wish you'd asked for after it ships.
Treating the smaller device lines as "later" for longer than I should have. I let the kids'-brush and interdental migrations drift into Q2 2019 when I could have pulled them into Q1. The reasoning felt sound — smaller install base, the platform work could "wait," spend the capacity on the big line. What actually happened is that the longer a line sat on its own old API, the more its own little customizations accreted, and the more there was to reconcile when we finally migrated it. Small systems don't stay small and clean while they wait. They grow their own weight. Migrate the cheap ones while they're still cheap, before drift makes them expensive — the opposite of the instinct to do the big valuable one first and mop up later.
Building OTA failure telemetry only just in time instead of early. The fleet-health dashboard that caught the canary brick was built in Q4 2018, weeks before the first rollout needed it. It worked — but if I'd built it in Q2 2018 I'd have had two quarters of baseline behavior to compare the canary against, and the anomaly would have been even louder and earlier. It wouldn't have changed the outcome that time. But the principle generalizes and I underweighted it: the only telemetry that helps is the telemetry you were already collecting before the event you need to detect. You cannot instrument a fire while it's burning. The dashboard you stand up the week you need it is the dashboard with no normal to measure against.
What I thought were mistakes that turned out fine
These are the calls I second-guessed at the time, braced to regret, and didn't. Worth naming, because "the conventional wisdom said X and we did Y and Y was right for us" is its own kind of lesson — the one about knowing your actual workload instead of the workload in the conference talk.
Picking Postgres over DynamoDB. A connected-product platform with a million devices doing a few sessions a day each: the 2018 conference-circuit answer was DynamoDB, full stop, NoSQL-at-scale, relational-won't-keep-up. We put the platform on Postgres on RDS instead, and I expected to be writing the "why we migrated off Postgres" post within eighteen months. Two reasons we didn't: the team's operational depth in Postgres was real and DynamoDB depth was not, and — the bigger one — our domain model was relational to the bone. Account owns Device owns Consumable, Session joins back to all three. That's a graph of foreign keys, not a bag of denormalized items, and forcing it into single-table DynamoDB would have meant fighting the data's actual shape to satisfy a scaling story we hadn't yet hit.
It scaled fine. Our write volume and access patterns sat comfortably inside what one well-tuned RDS instance handles. The honest read in hindsight: for a pure telemetry firehose I'd reach for a different store — high-volume append-only time-series is exactly DynamoDB's wheelhouse, and that's a tradeoff I've written about since — but our workload was a relational entity model with a telemetry side, not a telemetry firehose with some metadata, and we picked for the workload we had rather than the one the talks were about.
Home-grown REST ingestion instead of a managed device-cloud. The other call I braced to regret. The managed IoT-cloud option on the table assumed devices speak MQTT directly to a broker — and ours couldn't. Our device's only radio was BLE; every byte reached the cloud inside an HTTPS request made by the customer's phone, not the device. A device-direct broker model has nowhere to put a phone-as-gateway topology. So we built our own ingestion: REST with idempotency keys for the late-and-duplicated reality of phone-relayed uploads, per-device signing so the cloud could verify what the phone merely carried, append-only events. For a BLE-only fleet in 2018 it was simply the correct call — the managed option didn't fit the topology, not "we preferred to build."
The wrinkle that makes this a "turned out fine" rather than a "got right" is the part I couldn't have known: the home-grown stack was correct for its era and would be the wrong call in a later one. The day a connected device ships with its own WiFi, it can speak to a managed broker directly, the phone stops being load-bearing, and rolling your own ingestion goes from necessary to indulgent. I didn't make that newer call here — different team, different era, different radios — but the medical-IoT stack quietly taught me to date my architecture decisions. The right answer in 2018 and the right answer later aren't the same answer, and a decision that doesn't carry its own expiration date is a decision you'll defend past the point it's true.
The three things this arc taught me
Strip away the toothbrushes and the HIPAA memos and the byte-packed BLE frames, and two years left me with three convictions I haven't had to revise since.
One: an API platform and an IoT platform are the same thing wearing two name tags. I took this job thinking they were different disciplines and spent two years discovering the overlap is nearly total. Build an API platform right — versioned contracts you never mutate, append-only events, per-device attestation, a clean entity model underneath — and you have already built an IoT platform. The "IoT" framing adds a transport layer and a marketing budget. Everything load-bearing underneath is the same platform engineering it always was. I stopped treating "IoT" as a separate skill the day this clicked.
Two: the hardware constraint is the product constraint, not a limitation to apologize for. A device with no WiFi isn't a device missing a feature; it's a fundamentally different product with a different platform architecture. Our entire phone-as-gateway auth model, our whole ingestion design, the 20-byte BLE frame discipline — all of it falls directly out of "the radio is BLE and the battery is small." The teams that struggle treat that as a temporary annoyance to be engineered around. The teams that ship treat it as the first design input. The constraint isn't in the way of the architecture. The constraint is the architecture.
Three: the platform compounds; the hardware doesn't. This is the one I'd staple to the whole series. Every connected product you ship gets a fresh PCB, a fresh BOM, a fresh manufacturing line, a fresh certification — the hardware cost resets to zero with every device and you pay it again, in full, every time. The platform doesn't reset. It's there when the next device shows up. Invest in it and each new product is cheaper than the last, because it inherits the entity model, the auth, the OTA pipeline, the telemetry. Skip the investment and each new product is more expensive than the last, because it's another bespoke integration onto a pile of bespoke integrations — which is exactly the eight-API mess I walked into. The dentist portal going from a six-month integration to a one-week query is this curve made visible.
What this set up for me
I didn't know it walking out the door next week, but everything in this notebook is the first draft of a playbook I'd run again. Some years on, I came back to connected products from a different seat — my own engineering team, owning the hardware and the platform this time, in an era where the device carried its own radio and a managed device-cloud was the obvious default rather than a topology mismatch. The second run is its own series: the same arc — entity model, OTA, fleet identity, operational telemetry — with a newer toolchain and a wider scope. The longer retrospective across both closes the loop on what actually compounded from one era to the next, and the honest answer is: the principles did, the tooling didn't.
The medical-IoT years were the first draft. Cutting my teeth, literally — on a product you keep in a cup by the sink.
So the one thing I'd tell anyone starting on the platform side of a connected hardware product: build the architecture that survives the third device line, not the one that ships the first. The first device makes the demo. The third device is where a platform either pays you back or sends you the bill. Build for the third one. The work compounds — that's the entire point, and it's the only reason any of these thirteen unglamorous months were worth it.