Luke Angel
A fleet of connected-device pucks streaming telemetry into a monitoring dashboard — a sparkline, a bar-chart tile, a gauge ring, and an alarm bell firing on the one device that's gone red.

What good IoT observability looks like in CloudWatch

by
#iot#aws#cloudwatch#observability#ops

We've been running our connected-product fleet in production for about six months. The first incident, predictably, was an observability incident — we couldn't tell whether 200 devices had stopped talking because the devices were broken, the network was broken, the cloud was broken, or our parsing of the data was broken. It took us a full day to figure out which.

This is the CloudWatch setup we'd have built on day one if we'd known better.

(Previously, on v1, we built our own dashboards from scratch in 2018. The IoT-native cloud metrics weren't mature yet, and we ended up running everything off custom metrics emitted from serverless functions. On v2 the native side is much better. The setup below would have saved us about two engineer-months on the v1 build. It's now ~one engineer-week.)

The whole thing hangs off one decision made at the ingest Lambda: every metric and every log line carries the device's thing_name as a dimension. Get that wiring right and the dashboards, the alarms, and the Logs Insights queries all fall out of it.

Telemetry-to-CloudWatch pipeline for a connected-product fleet: device pucks publish over MQTT to AWS IoT Core; an IoT Rule routes every message to an ingest Lambda that stamps a server timestamp and tags the device's thing_name; the Lambda emits per-device p50/p95/p99 latency to CloudWatch Metrics and structured records to CloudWatch Logs; metrics feed the dashboards and alarms, logs feed scheduled Insights queries that post hourly to a #fleet-errors Slack channel. A side loop shows a fleet-diff Lambda running every five minutes and a last_seen_at attribute written back onto each IoT Thing.

The three dashboards

Dashboard one: fleet health, one row per device class.

Five metrics, plotted as time series across the last seven days:

  • Connected device count. A BinaryStateValue metric we emit when an MQTT connect/disconnect happens on IoT Core, summed across the fleet. Sudden drops here are the first thing to look at in any incident.
  • Messages per minute. Volume of iot:Publish events from CloudWatch Metrics for IoT Core. If devices are connected but not publishing, the firmware is wedged.
  • Per-device p50 / p95 / p99 publish-to-cloud latency. From our IoT rule pipeline — we stamp the message with a server timestamp on arrival, compare to the device-side timestamp, emit the delta as a custom metric. p99 tells you tail behavior; p50 alone hides everything.
  • MQTT auth failures. Suspicious if it spikes. Either we have a cert-rotation problem or somebody's trying to talk to our endpoint with a stolen credential.
  • Lambda error rate on the ingest function. If devices are happy but we're 5xx'ing on ingest, we're losing data.

Dashboard one is the only thing the on-call rotation looks at by default. Everything else is for diagnosis after that dashboard says something's wrong.

Dashboard two: per-device drill-down.

When dashboard one says "something's wrong," dashboard two is how you find the which. CloudWatch Contributor Insights with a rule that ranks thing_name by error rate. Top ten, last hour. Click one, jump to that device's logs and metrics.

We use thing_name as the partition key on our ingest Lambda's emit, so every metric we publish has the device dimension. This is the one decision that paid off most — every metric is per-device or per-job-site, never just an aggregate.

Dashboard three: pipeline health.

This one is for the engineers, not the on-call. It tracks:

  • IoT Rule SQL failures (a count that should be near zero).
  • Lambda concurrent executions and throttling.
  • DynamoDB write throttles, write latency p99.
  • Kinesis Firehose backlog (we pipe to S3 for analytics; backlog means analytics will lag).

If dashboard three is red, the infrastructure is unhealthy. If only dashboard one or two is red, the fleet is.

Incident-triage decision tree starting from something looks wrong, which dashboard went red. One branch: dashboard 1 or 2 red means the fleet is sick — device-count drops, auth failures, per-device p99 climbing — so you drill down by thing_name. The other branch: dashboard 3 red means the infrastructure is sick — IoT Rule SQL failures, Lambda throttles, DynamoDB write throttles, Firehose lag — so you fix the pipeline, not the devices. The footer reminds you to instrument errors-per-device, not errors-per-request, because you ask questions along the device dimension.

The four alarms

We have four production alarms. Anything beyond four is noise.

  1. Connected device count drops > 20% in 5 minutes. Paged. Either a cloud-side outage or a connectivity event in a region — either way, somebody needs to look right now.
  2. Ingest Lambda 5xx rate > 1% for 10 minutes. Paged. We're losing data.
  3. Per-device p99 publish-to-cloud latency > 2x baseline for 15 minutes. Slack-only, no page. Investigates next morning.
  4. MQTT auth failures > 100 in 5 minutes. Paged. Either fleet-wide cert issue or someone's poking at our endpoint with stolen keys.

Notice what's not on this list: total message volume drops, individual device offline, individual Lambda invocation errors. Those are too noisy to alarm on directly. They all show up on the dashboards; they don't fire pages.

The three dashboards and four alarms laid out together. Dashboard 1, fleet health, is what on-call watches by default: device count, messages per minute, per-device latency percentiles, MQTT auth failures, ingest error rate. Dashboard 2, per-device drill-down, uses Contributor Insights to rank thing_name by errors and jump to a device's logs. Dashboard 3, pipeline health, is for engineers: IoT Rule SQL failures, Lambda throttles, DynamoDB write throttles. If dashboard 1 or 2 is red the fleet is sick; if 3 is red the infrastructure is. The four alarms: three page (device count drops over 20% in 5 minutes, ingest 5xx over 1% for 10 minutes, MQTT auth failures over 100 in 5 minutes), one is Slack-only (p99 latency over 2x baseline for 15 minutes). Total-volume drops, a single device offline, and one Lambda error are deliberately not alarmed — too noisy to page on.

The one CloudWatch Logs Insights query

We have a saved query that I run more than anything else in the console:

fields @timestamp, thing_name, error_code, battery_pct
| filter ispresent(error_code) and error_code != ""
| stats count() as errors by thing_name, error_code
| sort errors desc
| limit 20

"For the time range in the toolbar, which devices are reporting errors, what errors, and how many?" Twenty rows of output. The answer to ninety percent of "is something wrong" questions.

Insights queries are also schedulable now (via Lambda or EventBridge), so we've got the same query running hourly and posting to a Slack channel. If a device's error count for an hour exceeds a threshold, it shows up in #fleet-errors with the thing-name, error code, and a deep link to the device's recent events.

What we built ourselves that I'd recommend

Two pieces of code that paid for themselves the first month:

A "fleet diff" Lambda. Runs every five minutes. Pulls the list of currently-connected devices from IoT Core. Compares to the list of devices we expect to be online (from our customer database). Emits the diff as a metric. When 200 devices fell silent, this Lambda noticed within five minutes, instead of us noticing the next day.

A per-device "last seen" attribute. We update a last_seen_at attribute on the device's IoT Thing every time it publishes, via the IoT rule. Then a CloudWatch Insights query against the IoT Things index gives us "devices that haven't published in N hours." Predictably useful.

What I'd skip

A few things I tried that didn't earn their keep:

  • X-Ray tracing on every Lambda invocation. Too noisy at fleet scale and the cost adds up. We turn it on for specific debugging sessions, not always.
  • Per-device CloudWatch Logs streams. Don't do this. CloudWatch Logs is priced per ingested GB; if you're emitting structured logs from every device every minute, you'll regret it. Aggregate at the rule layer; emit logs from the cloud side only.
  • Synthetic device pingers from another region. Tempting, but the failure mode it catches is "AWS region is broken," which CloudWatch will already tell you about. Not worth the complexity.

The bigger framing

The lesson of the six months: an IoT product is a fleet operations product, not a software product. Software products have errors per request. Fleet ops products have errors per device, per device class, per firmware version, per job site. You instrument for the dimension you'll ask questions along, and you ask questions along devices.

Six months from now I'll know whether we got the dashboards right. Six months ago, we didn't have dashboards. That's the bigger move.

Keep reading

shares tags: #iot · #aws
tools
Building a connected hardware product — month one
Nov 29
tools
Defense in depth for a connected-product fleet
Dec 02
tools
DynamoDB for time-series IoT — when the relational urge is wrong
Mar 19