A model at the center — drawn as a small neural-net graph — fed by an operations pipeline on one side and serving a product on the other, with a dashed feedback loop returning from the product back to the pipeline.

Notebook · 14 parts

Notebook · 14 parts · read in order

~126 min total

Building with AI/ML — Products & Operations

Bedrock went GA in the fall of 2023, and within a few months the team had a live feature running on it. Three years on it's an agent stack in production, a prompt-regression set that catches drift before customers do, and a quarterly spend audit that keeps the bill honest. The running log of getting AI/ML into real products — and building the operations to run it without lighting money on fire.

Every team's first AI/ML project is a demo. The hard part — the part nobody live-streams — is the second mile: getting a model into a product a customer actually touches, then building the operations to keep it honest after launch.

This notebook is the running log of that second mile. It starts the week Bedrock went GA and runs through a production agent stack — Agents, then AgentCore, then Strands — alongside the unglamorous machinery that makes any of it ship: a prompt-regression test set that catches drift before a customer does, SageMaker pipelines, the Lambda-vs-Fargate-vs-ECS call, and the quarterly spend audit that keeps an AI bill from quietly tripling.

Two threads run through all of it: AI/ML in the product — what the customer sees — and AI/ML in the operations — the pipelines, tests, and cost discipline that keep it running. Neither ships without the other. This is what I learned building both.

inside this notebook —

01 → 14

A column of input-and-expected-output pairs feeding a comparison check, where a fresh model output is matched against the expected one — a regression is caught at the check before it can reach the customer on the far side.

Prompt regression — the rough first version of our test set

Nov 2023

open →

One AWS endpoint fanning out to a row of foundation models from different providers, with a single IAM key and a private network boundary drawn around the whole exchange — one SDK, one bill, many models.

First week with Bedrock: a PM's read on the September GA

Dec 2023

open →

A branching path forking from a single trunk into three landing pads of increasing size — a small serverless puck, a single container box, and a rack of boxes on a host — the compute choice narrowing as the workload grows.

Lambda vs Fargate vs ECS — the napkin decision tree

Feb 2024

open →

An ML pipeline drawn as a left-to-right DAG — prep, train, and eval steps feeding a conditional gate, which either passes to a stacked model-registry card with a check seal or halts on regression — with a dashed feedback loop returning from the deployed end back to data prep.

SageMaker Pipelines — week three notes

Apr 2024

open →

An agent orchestration loop drawn as a closed cycle — a model node thinks, reaches out to a tool, takes back an observation, and loops, with a managed boundary wrapped around the whole cycle and a knowledge store feeding it from the side.

Bedrock Agents, half a year in — the parts I actually use

Jul 2024

open →

A multi-tenant platform diagram: separate tenant lanes feeding a shared application layer that sits on a stable abstraction above a row of interchangeable model blocks, with an evaluation gate guarding what ships.

Building a multi-tenant LLM platform, 0 → 1

Oct 2024

open →

Three things from re:Invent that actually change my roadmap

Dec 2024

open →

A managed retrieval pipeline drawn left to right — documents chunked, embedded into vectors, retrieved, and answered — with small warning markers sitting on the chunking, retrieval-filter, sync, and re-ingest steps where the production gotchas live.

Knowledge Bases for Bedrock in production — the gotchas list

Feb 2025

open →

An open-source eval starter kit for product managers

Mar 2025

open →

A stack of spend bars leaking red drops out their right edges while a magnifying glass passes over them, a green check inside the lens — the quarterly audit catching the recurring leaks.

The AWS spend audit I do every quarter

Apr 2025

open →

Bedrock AgentCore at Summit NY — what it actually changes

Jul 2025

open →

Three stacked layers of an agent platform — an SDK layer, a managed-runtime layer, and a row of pluggable services (memory, gateway, identity, observability) — with a column of team-shape choices on the left routing to a recommended stack on the right.

Strands + AgentCore — a year-end agent-stack inventory

Nov 2025

open →

A central agent hub fans out to several DevOps workflow lanes; two lanes pass with green checks and two are crossed off in red and dropped — the workflows that earned their keep versus the ones shut off.

Agent-based DevOps with Q Developer — kept vs tossed

Mar 2026

open →

Roadmap reviews — four columns stolen from a therapist, for AI work that won't sit still

Roadmap reviews when half the work is non-deterministic

Apr 2026

open →

Start here

01 · Prompt regression — the rough first version of our test set

open part 01 →

Part 01 of 14

Building with AI/ML — Products & Operations · part 01

Nov 15, 2023

Prompt regression — the rough first version of our test set

We don't have a name for it yet. A spreadsheet of inputs, a spreadsheet of expected outputs, and the smallest script that runs one against the other. Notes from week two.

The engineers I lead shipped a feature last quarter that uses a model behind it. Twice now we've noticed that a small change to the prompt made the outputs worse in a way nobody caught for ten days. That's not a process problem. That's a missing test bed.

I sat down this week to try to fix it. This is what week two looks like.

What we have so far

A spreadsheet. About forty rows. Each row has:

An input — a real user query we pulled from production logs (anonymized).
An expected output — the answer one of the PMs sitting in the room thought was the best version. Sometimes that's a paragraph. Sometimes it's a single sentence. Once, it's "the assistant should refuse politely."
A note about why that's the expected output — what we'd lose if the model said something different.

That's it. Forty rows in a Google Sheet. Two columns load-bearing, one column is justification.

The script

The script around it isn't much smarter. For each row, it sends the input to our model, captures the output, and dumps the pair into a second spreadsheet for a human to review. The human is me on Sundays for now.

What it doesn't do yet:

It doesn't score anything automatically. I've read about using a stronger model to grade outputs (a paper went around earlier this year — "LLM as a judge" — that I'm still chewing on). For now, my eyes are the grader.
It doesn't fail anything in CI. If the outputs look worse, I have to notice, and notice is doing a lot of work in that sentence.
It doesn't have a rubric. I keep wanting to call this thing a rubric, but a rubric implies I've decided what "good" means on each axis. I haven't. The spreadsheet's "note" column is doing that job badly.

What's already useful

Two surprises after two weeks:

One, the act of writing the expected output forces an opinion. Our team had been arguing for a month about whether the assistant should hedge ("I might be wrong, but…") or commit. The argument felt unresolvable. The moment we had to write down the expected response for ten queries, we settled it in an afternoon. The expected outputs were all committed and direct. Argument over. The spreadsheet ended the meeting.

Two, the worst regressions were format regressions, not content regressions. A prompt change made the model start answering in bullet points when the downstream UI expected a paragraph. The content was fine. The shape of it broke a parser two hops away. I would not have predicted that.

What I'm trying next week

Three things:

Tag each row with a scenario — "first-time user", "power user follow-up", "ambiguous query", "refusal-required". Right now they're undifferentiated. If we lose ground on the refusal-required rows, that's a much bigger deal than losing ground on the easy ones, and the average score is hiding it.
Add a column for what would make this answer worse — the failure mode we're trying to avoid. That's closer to a rubric than what I have, but I'm not ready to commit to four axes yet. Maybe next month.
Get one of the engineers to run the script on every prompt change before we ship. Right now it runs when I get to it on a Sunday. That doesn't scale.

What I still don't know

I don't know what to call this. "Tests" is wrong because the outputs aren't deterministic. "Regression suite" is half-right but oversells how rigorous it is. The paper I mentioned uses the word "evals." Maybe that's where this goes.

I also don't know what the threshold for "good enough" is supposed to be on a forty-row spreadsheet where the grader is me. Probably I need both more rows and a less-biased grader before that question has an answer.

For now, the test set exists. It already caught one regression that would have shipped. That's enough to keep going.

More next month.

↑ overview

open on its own page ↗ next: part 02 →

Part 02 of 14

Building with AI/ML — Products & Operations · part 02

Dec 12, 2023

First week with Bedrock: a PM's read on the September GA

Amazon Bedrock went GA in September. I spent last week pointing one of our LLM features at it. Notes — what's useful, what's missing, and what re:Invent added two weeks ago.

Amazon Bedrock went generally available on September 28th. We've been on a private preview waitlist for months. I finally got the team to spend a week last week porting one of our LLM features off of a direct API integration and onto Bedrock. Notes from that week.

What Bedrock is, in two sentences

It's a single AWS endpoint that lets you call multiple foundation models — Anthropic Claude, AI21 Jurassic-2, Cohere Command, Meta Llama 2 (the 13B and 70B chat models landed on Bedrock just last month), Stability's image models, and Amazon's own Titan family — through one SDK and one IAM model. You pay AWS, AWS pays the model providers, and your data doesn't leave your VPC unless you let it.

It's not a model. It's a model router.

The two reasons we ported

Procurement. Talking to AWS about a model is the same conversation we already have about S3 and Lambda. Talking to Anthropic about a model is a new conversation with a new contract, a new procurement review, a new security questionnaire. Our company's vendor-onboarding process for a new AI vendor is a six-week thing. Bedrock collapses that to zero.

Optionality. We've been calling Claude 2 directly through Anthropic's API. That works fine until the day a competitor's model is meaningfully better on our task and we need to swap. With Bedrock, the swap is mostly a model-ID change in our code; with a direct integration it's a quarter of work. We're not paying for optionality we won't use — Claude is still the best fit on our task right now — but we wanted the ability to swap when the landscape shifts.

What worked in week one

The SDK is calm. boto3.client('bedrock-runtime').invoke_model(...). If you've written boto3 once, you've written this once. No surprises.
IAM works the way you'd expect. Bedrock invocation is just another IAM permission. We were able to scope the API key to "only Claude 2, only from this Lambda" in about ten minutes. Doing this with a direct vendor API key would require us to write our own gateway.
PrivateLink is there. For the parts of our workload that can't make outbound calls, we can keep the whole thing inside our VPC.

What didn't work

Streaming is uneven across models. Some models support streaming responses; some don't yet. Our UX relies on it, and the abstraction over models leaks here. We had to special-case behavior by model ID, which is exactly what an abstraction is supposed to prevent.
The pricing took a minute. Per-token pricing varies by model, and the dashboard for tracking spend is not as developed as the rest of AWS Cost Explorer. We're going to have to roll our own per-feature cost tracking until that catches up.
Provider parity isn't guaranteed day-zero — but it's closer than I expected. My going-in fear was that Bedrock would always trail the provider's own API by a release or two, and that a managed router means you wait for the latest model feature. The Claude 2.1 timeline argued the other way: Anthropic announced it on November 21st with a 200K context window, and it was generally available on Bedrock by the 29th — about a week, and it landed in the same re:Invent window as everything below. So the gap, at least this time, was small. I'm not treating that as a promise. The honest posture is that you can't assume same-day parity for a feature your roadmap depends on; you confirm a specific model and capability is on Bedrock before you commit to it. Bedrock earns its place when the model is good enough and the procurement/security story matters more than being first to a brand-new capability — not because the capability is necessarily late.

The re:Invent additions

Two weeks ago at re:Invent (Nov 27 – Dec 1), AWS dropped a pile of Bedrock additions. The headline:

Agents for Amazon Bedrock — orchestrate multi-step calls (call a tool, read a knowledge base, call the model, route the result). Went GA at the show. GA two weeks ago is not the same as battle-tested, though — a v1 agent framework still moves under your feet, and I'd rather watch other people find the sharp edges first. We'll run it through a real evaluation, not wire it into anything a customer touches yet.
Knowledge Bases for Amazon Bedrock — a managed RAG pipeline, basically. Also GA at re:Invent. Same posture: worth a serious look, but I'm not migrating our hand-built retrieval onto a two-week-old managed service on faith.
Guardrails for Amazon Bedrock — in preview, not yet generally available. Content policies that travel with the model invocation. We need this badly; doing it ourselves at the application layer is fragile. But preview means it can change, so it's not something I'll commit a launch to until it's GA.

Two things newly GA, one in preview, and a lot I'm choosing to wait on. The reality is: we're keeping our app-layer guardrails and our own retrieval for now, and putting Agents and Knowledge Bases through an honest evaluation before we let either near production.

My take after a week

Bedrock is the absence of a vendor lock-in argument. That's the whole pitch. If you're inside a company where the security team and the legal team are the gating risk on shipping LLM features, Bedrock is the answer. If you're inside a startup where the gating risk is "which model is best this week," you're still going direct to the provider.

The team I lead is the first kind of team. So we're on Bedrock now. The port took five working days end to end. I'd do it again.

I'll write a follow-up in a couple months once we've had Agents and Knowledge Bases through a real evaluation. For now: useful, boring, and exactly what the procurement conversation needed.

← part 01 ↑ overview

open on its own page ↗ next: part 03 →

Part 03 of 14

Building with AI/ML — Products & Operations · part 03

Feb 06, 2024

Lambda vs Fargate vs ECS — the napkin decision tree

Four questions, in order. The right AWS compute choice usually picks itself by question two. Notes from an architecture review the team kept losing the same way.

My team has been running the same forty-minute architecture-review meeting for a year. New service, three engineers in the room, the question is Lambda, Fargate, or ECS, and we re-derive the answer from first principles every single time.

I got tired of it. So I drew a four-question decision tree on a napkin at lunch last week and made everyone agree to use it. Notes here so it lives somewhere besides the napkin, which I have lost.

The four questions, in order

1. Does this workload have to be ready in under 100 ms cold?

If yes — interactive user-facing path, real-time event processing where 200 ms tail latency tanks the experience — you can't use Lambda for the cold path. Lambda cold-starts on Node are around 200 ms; on Java without SnapStart it's a second or two. SnapStart helps for Java, but Python and .NET don't have it yet.

For the cold-tolerant workload — async jobs, scheduled tasks, webhooks, things on a queue — Lambda is the right default.

2. Will this run for more than 15 minutes at a stretch?

Lambda has a 15-minute execution cap. Always has, probably always will. If the workload is a long-running batch process, ML training, video encoding, anything where the per-invocation budget is genuinely longer than 15 minutes, Lambda's out.

Fargate or ECS for those.

3. Do you need a specific runtime environment that Lambda doesn't ship?

Lambda has a fixed set of runtimes and a Linux Amazon Linux 2 base. If the workload needs a system library Lambda doesn't have, a custom kernel module, a specific FFmpeg build, or anything that needs a particular Docker base image — you can do this with Lambda container images, but the friction is enough that Fargate or ECS is usually the right move.

4. Are you running enough hours per month that "always on" is cheaper than "pay per request"?

This is the only question with real math behind it. The rule of thumb on my team: if a service runs more than about 40% of the hours in a month, ECS Fargate ends up cheaper than Lambda at typical request rates. Below that, Lambda's per-invocation pricing wins. AWS publishes a calculator; we don't trust it without checking against our own request-pattern data.

$The cost crossover, drawn as two lines on a utilization-versus-cost plot. The horizontal axis is the fraction of hours per month the service is busy, from idle on the left to always-on on the right. Lambda's cost line starts near zero and climbs steeply with utilization, because you pay per invocation. Fargate's line starts higher — there's a floor for keeping a task running — but climbs gently, because an always-on task costs roughly the same whether it's busy or idle. The two lines cross at roughly 40 percent utilization: left of the crossover Lambda is cheaper, right of it Fargate is cheaper.$

Different question: do you need to control instance shape (memory + CPU ratio, GPU access, specific networking) more tightly than Fargate exposes? If yes, ECS on EC2.

The flowchart, in two sentences

If cold-start sensitive or runs > 15 min → not Lambda. If high enough utilization or needs instance control → ECS over Fargate; otherwise Fargate. Else Lambda.

That's it. Four questions, three sentences, one decision.

Where this breaks

The decision tree assumes you're already inside AWS. If you're greenfield and could go to a different cloud, that's a bigger conversation. The tree also assumes the workload is a service — for batch workloads, AWS Batch is the obvious answer and isn't in the tree because the question doesn't come up the same way.

It also doesn't address EKS. We use EKS at our company for the workloads where Kubernetes is a hard requirement from elsewhere in the org. That's a different decision than the compute one. If you're standing up a new service and asking "should I use Kubernetes" without an external constraint forcing the answer, the answer is no.

The meeting that doesn't happen anymore

The point of the tree isn't that I think compute decisions are simple. It's that they're not the most expensive decision in the architecture, and they deserve five minutes of room, not forty.

Our team's architecture reviews are now back to spending the bulk of the hour on the actual hard parts — data model, failure modes, observability — instead of relitigating Lambda vs ECS for the fifth time this quarter.

Print the tree. Tape it to the conference room wall. Get the time back.

← part 02 ↑ overview

open on its own page ↗ next: part 04 →

Part 04 of 14

Building with AI/ML — Products & Operations · part 04

Apr 23, 2024

SageMaker Pipelines — week three notes

Three weeks into moving our ML pipeline onto SageMaker Pipelines. What stuck, what I'd swap, and the one decision I'd undo if I started over.

We've been moving our ML training pipeline onto SageMaker Pipelines for three weeks. The pipeline trains a small predictive model that lives behind one of our features; nothing exotic, just a real workload that was a tangle of glue scripts that nobody on the team wanted to own anymore.

Three weeks in, here's what I'd tell the next team about to do the same thing.

What SageMaker Pipelines actually is

A DAG runner for ML steps. Each step is a typed thing — ProcessingStep, TrainingStep, TransformStep, RegisterModel, CreateModel, LambdaStep, conditional gates — and you wire them together in Python. The runtime is managed; the artifacts move through S3; lineage is tracked.

The simplest way to think about it: it's an Airflow that knows what an ML step is, runs on someone else's infrastructure, and integrates with the rest of the SageMaker ecosystem (Studio for notebooks, Model Registry for versioning, Model Monitor for drift) without you having to wire those integrations.

What stuck in week three

The Model Registry. Before this, our "model versioning" was a folder in S3 with a date in the name and a markdown file describing what was inside. The Model Registry replaces that with first-class versioning, approval workflows, and a model lineage graph that shows you which training run, which data slice, and which hyperparameters produced a given artifact. It is the single biggest "why didn't we have this years ago" feeling of the project.

Conditional steps. The pipeline has a step that compares the new model's eval score to the deployed one. If new < old by more than a threshold, the pipeline halts and the registration step never runs. This used to be a Slack thread on Friday afternoons. Now it's a JSON condition in the pipeline definition.

The cache. Steps with identical input artifacts and config get cached. A data-prep step that takes 18 minutes runs once; subsequent pipeline executions that haven't changed the input or the code skip it. This made iterating on the back half of the pipeline (training, eval, deploy) about three times faster.

The mechanism is a cache key built from the step's inputs and config — change neither and the step is skipped, change either and it re-runs:

What I'd swap

The SDK is verbose. Defining a pipeline takes a lot of Python boilerplate — instantiating Processor, Estimator, ProcessingStep, wiring ProcessingInput and ProcessingOutput per step. The team kept getting tripped on subtle distinctions between argument shapes. I ended up writing a small wrapper around the most common patterns. If I were starting over, I'd write that wrapper on day one, not day fifteen.

Local mode is fragile. SageMaker Pipelines has a local-mode runner that's supposed to let you test pipelines without spinning up SageMaker resources. In practice it has enough corner cases — Docker permissions, IAM differences, paths that work on Linux but not on a Mac — that the team mostly gave up and tests against a dev SageMaker account. That's slower but more predictable.

Cost visibility lags. A pipeline run can spin up multiple instances across different step types, and reading the per-pipeline cost back out of Cost Explorer is harder than it should be. We're tagging every pipeline run with the experiment ID and writing our own cost report on top. Not hard, but you'll need to.

The one decision I'd undo

We tried to put everything in the pipeline on day one. Data prep, feature engineering, training, eval, model registration, batch transform, monitoring setup, even the Slack notification at the end — all of it as pipeline steps.

I would not do this again. The right move is to put the production-critical steps in the pipeline and leave the experimental steps in a notebook for as long as possible. Once a piece of the workflow is stable, then promote it. We've spent more time refactoring "experimental step in pipeline shape" than we did building the original pipeline.

The rule the team is settling on: if the step has been the same for a month and you'd be mad if it ran differently next week, it belongs in the pipeline. Otherwise it lives in a notebook.

How it ties into the rest

Two things I didn't expect to matter that matter:

JumpStart's Foundation Model tab is now where we go to test a candidate base model before integrating it into the pipeline. Claude 3 (which Bedrock got six weeks ago) and Llama 3 (which Meta dropped last week) both show up there alongside the older families. Useful for sanity-checking "would this model do the job" before paying for the integration.
Model Monitor + the Pipelines deploy step is the loop we didn't have before. Deploy from the pipeline, monitor automatically picks up the endpoint, and drift detection runs against the same data slice the pipeline used for eval. That last piece — the eval set and the monitoring set are the same set — is what makes the whole thing trustworthy.

What's next

Two things I'm going to tackle in the next sprint:

Pipeline triggers from S3 events. Right now the pipeline kicks off on a schedule. We want it to fire when new training data lands. Standard EventBridge → Lambda → Pipeline pattern, just haven't done it yet.
A second pipeline for the eval-only path. We want to be able to re-evaluate the current production model against new data without retraining. Separating the eval pipeline from the train pipeline so we can run them independently.

Three weeks isn't long enough to have strong opinions. It's long enough to have noticed which mistakes are mine and which are the tool's. Write more in a couple of months.

← part 03 ↑ overview

open on its own page ↗ next: part 05 →

Part 05 of 14

Building with AI/ML — Products & Operations · part 05

Jul 17, 2024

Bedrock Agents, half a year in — the parts I actually use

Agents and Knowledge Bases for Bedrock have been GA since re:Invent. I said I'd run them through a real evaluation before trusting them; here's the verdict after about six months — what I kept, what I deferred, and the call that's obvious in hindsight.

Back in December I wrote that I'd put Agents and Knowledge Bases for Bedrock through a real evaluation before letting either near anything a customer touches. Both went GA at re:Invent — November 28th — and "GA two weeks ago" is not the same as "battle-tested," so the honest posture then was to watch other people find the sharp edges first.

It's mid-July now. We've had both running against real workloads for the better part of six months. This is the verdict I promised: the parts I actually use, the parts I deferred, and the one call that's obvious in hindsight.

What Agents actually is

A managed orchestrator. You define:

A base model — Claude 3 Haiku / Sonnet / Opus, Llama, Titan, Cohere, whatever Bedrock hosts.
A set of action groups — functions described by an OpenAPI spec that the agent is allowed to call. Each is backed by a Lambda you write.
An optional knowledge base — a managed vector store the agent can read from.

You give the agent a goal, and it loops: think, call a tool, observe the result, think again, until it's done or the loop budget runs out. AWS owns the prompt scaffolding, the tool-use formatting, the retry behavior, and the trace logging. You own the tools and the goal.

It is not magic. It is router code with model-shaped opinions. Six months of running it hasn't changed that sentence — it's confirmed it. Which is the whole reason I trust it for some jobs and not others.

What I kept

Three things made it past the evaluation and into something real.

Customer-support deflection. Knowledge Base over our support docs, action groups for the three or four things a support rep actually does — look up an order, send a refund link, open a ticket. The agent answers the easy questions and opens a ticket on the hard ones. This was the obvious win, and it's the one I'd start a team on if they asked. The failure mode is benign: worst case, the agent opens a ticket a human was going to open anyway.

Internal ops bots. "Spin me up a dev environment for project X." Action groups wrapped around our dev infra. The agent reasons about what the engineer wants, calls the right tools, reports back. Saves the platform team an interrupt a day. This one lives behind the VPN, which matters — I'd never have shipped it this fast facing the public internet.

Data-analyst copilot, with a human on the trigger. Knowledge Base over the data catalog — table schemas, column descriptions, recent queries — and an action group that hits Athena. The agent turns a business question into SQL and runs it. The hard rule we settled on after the evaluation: a human reviews the final query before it executes against anything that costs money or touches PII. This is not autopilot for analysts. It's a faster first draft with a person in the loop, and the loop is non-negotiable.

The common thread in all three: a wrong answer is cheap. That's the line I'd draw for anyone deciding what to put an agent on first.

What I deferred

Two things I looked hard at and chose to wait on.

Agent-as-the-product. The current loop is good for task completion under supervision. It is not good for fully autonomous behavior on a long-horizon task — the thing where a user types a request and the agent runs unsupervised for an hour. I ran exactly that experiment during the evaluation, and I spent more time writing guardrails for the failure cases than I spent on the agent itself. The loop is reliable for ten-minute tasks with a human watching the trace. Stretch it to an hour alone and the error rate compounds turn over turn. If your product is the unsupervised agent, the framework isn't there yet. Wait.

The math is unforgiving: if each turn is right 97% of the time, ten turns land near 74% end-to-end and thirty turns near 40%. That's the whole reason the supervised ten-minute task ships and the autonomous hour doesn't.

Multi-agent orchestration. Bedrock's model today is one agent with multiple action groups. Agents calling other agents is something you build yourself on top — there's no managed multi-agent primitive in Bedrock right now. You can do it: have one agent's action group invoke a second agent. But you're writing the coordination, the message passing, and the failure handling by hand. The open-source frameworks — LangChain, LlamaIndex, and the newer crewai — have more developed patterns for that today. If multi-agent is core to your design, the pragmatic split is: Bedrock for the model calls and the procurement story, one of those frameworks for the orchestration on top. I'm watching to see whether AWS ships a managed version of this; I'd bet they do, but I'm not building my roadmap on a bet.

The one decision now obvious in hindsight

A year ago — before re:Invent — we wrote our own minimal agent loop in Python. Tool-use formatting, retry behavior, trace logging, the works. About 600 lines that one engineer maintains.

If Agents had been GA when we started, we wouldn't have written that code. We'd have used Agents from the jump. Now that I've run both side by side for six months, the honest read is: the 600 lines and the managed Agent do the same job, and the managed one does it with logging and traces I don't have to maintain. We're porting our home-grown loop onto Agents this quarter for the support and ops use cases — the cheap-mistake ones — and keeping our own loop only where we need control the managed version doesn't expose yet.

The lesson is the same one I relearn every time AWS GAs a managed service that overlaps something we built: the moment you write infrastructure, AWS GAs the managed version six months later. You can be annoyed about it or you can plan for it. The plan is — don't fall in love with the orchestration code. Fall in love with the rubric, the eval set, and the tool definitions. Those are durable and portable. The loop in the middle is interchangeable, and now it's interchangeable with something AWS keeps the lights on for.

What it cost me to learn that: roughly a quarter of an engineer's time maintaining a loop that a managed service now does for free. Not catastrophic. But it's the second time, and I'd like it to be the last.

What about Knowledge Bases?

Six months in: useful, and it replaced about 80% of the work of standing up our own RAG pipeline. You point it at an S3 bucket, it chunks the documents, embeds them, and exposes a query API. Chunking strategy and embedding model are configurable; the defaults are good enough that I left them alone for the support corpus. Pricing is driven by retrieval volume and embedding, not a flat per-seat fee, so a low-traffic internal tool costs a rounding error compared to standing up and babysitting our own vector store.

The 20% it didn't replace, and where I still hand-roll:

Hybrid search. If you need lexical and vector retrieval — keyword exact-match alongside semantic — you're assembling that yourself. The managed path is vector search over chunks, full stop.
Metadata filtering at retrieval time. The API supports it, but it was thinly documented when I built on it, and I lost an afternoon to trial-and-error getting filter syntax right. It works; it just wasn't the smooth path the rest of it is.

Neither gap was a dealbreaker for the support use case, which is plain semantic search over docs. Both would matter a lot more if I were retrieving over structured records. Know which one you have before you commit.

The thing I'm watching

Guardrails for Bedrock went GA in April — denied topics, content filters, sensitive-information redaction, word filters, applied at invocation. With Agents and Knowledge Bases GA since re:Invent and Guardrails GA since the spring, AWS now has the three pieces of "an LLM app, managed end to end." A year and a half after the post-ChatGPT scramble started, the managed versions of all three homemade pieces — orchestration, retrieval, and safety — are shipping. That cadence is fast even by AWS standards, and it's why the "don't fall in love with the loop" lesson keeps paying off.

The piece I'm still maintaining myself, and watching for a managed replacement, is the eval harness. Bedrock has Model Evaluation in preview right now. If it GAs and it's good enough to retire the regression set we run by hand, I'll port to it the same way I'm porting the agent loop — and I'll fall in love with the rubric, not the runner.

For now: Agents and Knowledge Bases are GA, they earned their place on the cheap-mistake jobs, and the bar to ship a supervised LLM feature inside AWS is genuinely lower than it was when I last wrote about this. The autonomous-agent dream is still a roadmap item, not a product. Knowing the difference is most of the job.

← part 04 ↑ overview

open on its own page ↗ next: part 06 →

Part 06 of 14

Building with AI/ML — Products & Operations · part 06

Oct 15, 2024

Building a multi-tenant LLM platform, 0 → 1

The internal name was Plural: a multi-tenant LLM platform enterprises could trust with their own data. Three engineers to nineteen, eighteen months. The hard part was never the model calls.

The internal name was Plural. I'll write about it the way I write about any former employer's internals — the shape of the work, not the logos: a multi-tenant LLM platform that enterprise teams could trust with their own data.

That phrase — trust with their own data — is the whole engineering problem hiding inside a sentence the sales deck loved. It means real tenant isolation, data boundaries you can defend in a security review, and an audit story that holds up — not a good demo. That was the load-bearing work. The model calls were the easy part.

The floor kept moving

A stable application layer sitting on an abstraction line, with a row of interchangeable model blocks underneath that can be swapped without the layer above changing; a side note marks that model choice is an evaluation and a config value, not a rewrite.

The defining constraint of building on LLMs in 2024 was that the model layer underneath you changed every few weeks — new models, bigger context windows, new prices. Wire a product straight to a model and every one of those is a migration.

So the architectural bet was to put the product on an abstraction it could stand on while the models churned underneath: model choice became something you evaluate and configure, not something you rewrite around. That single decision is the reason a team of nineteen kept shipping product instead of chasing the model of the month — and it's the bet I'd make first on any platform sitting on a volatile layer, not just an AI one.

Evals were the quality system, not a ritual

A pull request flowing into a held-out evaluation set, scored against a rubric; a threshold gate then forks to "ship" on pass and "blocked" on fail — the regression gate for a non-deterministic system.

A non-deterministic system needs a regression gate the way deterministic code needs a test suite — it just can't be a string compare. We ran a weekly eval rubric the whole team scored against a held-out set, and it became the thing we shipped against, not a report we filed after.

Two things fell out of that. A model swap stopped being a leap of faith and became a number you could read before you committed. And when the data said a feature wasn't earning its place, we killed it — including one my CEO loved. The rubric outranked the org chart. That's the only way a call like that ever survives.

Multi-tenant is a discipline, not a feature

Isolation is the kind of thing that's cheap when you design it in and brutally expensive when you bolt it on after the first enterprise customer signs. The platform's tenancy story was a first-class design constraint from the first architecture review, not a hardening pass before launch — because "trust with their own data" is a promise you either build the system around or spend the next year apologizing for.

Scaling the team and the system together

Three engineers to nineteen over eighteen months. Seven enterprise customers. And the number I'm quietly proudest of: zero lost weekends — because the abstraction layer and the eval gate meant we were rarely firefighting a model change at midnight. Calm is a leading indicator that the architecture is doing its job.

The launch playbook and the post-launch retro we wrote became company-standard, and they traveled with those engineers wherever they went next. That's the only real test of whether the rigor was load-bearing or theater: does it outlive the project and the people who wrote it.

What I'd carry into the next one

Build the abstraction over the volatile layer before you have to. The migration you avoid is worth more than the one you execute well.
Stand up the eval gate before there's much to gate. It's culture as much as code — the team has to believe the number outranks the opinion, and that belief is easier to build early.
Treat isolation as a design constraint, not a feature. "Trust with their own data" is architecture, not a checkbox.

The eval discipline, distilled to its smallest open-source shape, is the Eval Starter Kit; the rest of this thread runs through the Building with AI/ML notebook.

← part 05 ↑ overview

open on its own page ↗ next: part 07 →

Part 07 of 14

Building with AI/ML — Products & Operations · part 07

Dec 09, 2024

Three things from re:Invent that actually change my roadmap

AWS announced two hundred things last week. Three of them actually change what we'll build next year. Notes from re:Invent 2024, written on the plane home.

re:Invent 2024 ended Friday. I was there. The announcement pace was, as always, deliberately overwhelming. I spent yesterday afternoon writing the list of everything AWS announced; it was four pages.

The list of things that actually change my team's roadmap is much shorter. Three items. Here they are, with the reasoning.

1. Bedrock Prompt Caching — agent costs got cheaper by an order of magnitude

What it is: Bedrock now lets you mark portions of your prompt as cached. On subsequent calls that share the same prefix (system prompt, tool definitions, knowledge-base context), you pay a fraction of the input-token cost for the cached portion. It shipped alongside Intelligent Prompt Routing, the other half of the same cost-optimization push.

The caveat I'm writing down before I get excited: it's a preview, not GA. It's live in us-west-2 for Claude 3.5 Sonnet v2 and Claude 3.5 Haiku, and in us-east-1 for the new Nova models. A preview feature can change its pricing, its API surface, or its cache TTL between now and GA, so anything I plan on it is planned in pencil.

Why this matters for us: our agent loops re-send the same 4K-token system prompt + tool definitions on every iteration. With a 10-step agent run, we were paying for that 4K prefix ten times. With prompt caching, we pay full price once and the cache rate after.

The math, on the preview's published discount: at Claude 3.5 Sonnet v2 pricing, our agent's per-run cost drops from roughly $0.18 to $0.04. We run about 50,000 agent invocations a month. That's $7K/month back, which we'll redirect into the more aggressive evaluation runs we'd been deferring on cost.

Roadmap change: the cost ceiling that was capping our agent rollout to power users lifts. Opening agent-assisted features to free-tier users in Q1 moves from "off the table at the old cost shape" to "on the table, if the feature reaches GA on a timeline and price I can commit to." I'm not promising a launch on a preview. I'm un-blocking the planning.

2. The SageMaker rebrand + unification — data and ML stop being two separate orgs

What happened: the product formerly known as SageMaker is now SageMaker AI. The umbrella product called SageMaker (no "AI") is the new unified data + ML platform that wraps SageMaker AI, Athena, Redshift, QuickSight, EMR, Glue, and Lake Formation under a shared workspace and lineage model.

Why this matters for us: at our company, "the data team" and "the ML team" report to different VPs, build on different stacks, and meet to argue about ownership boundaries about once a quarter. The unified SageMaker doesn't fix the org problem, but it does mean the tooling stops reinforcing the org problem.

Specifically: the pitch is that a SageMaker Unified Studio notebook can query Redshift, materialize the result into a governed lakehouse table, train a model on it via SageMaker AI, and run drift monitoring against it — all with the same lineage graph behind it. Previously, each of those steps lived in a different tool with different permissions and no shared lineage. The lakehouse sits on an open, Iceberg-compatible architecture, which matters because it means I'm not betting on a proprietary table format to get the unification.

The caveat, same as item 1: Unified Studio is in preview (GA slated for 2025). SageMaker AI itself and the lakehouse are GA, but the single-pane studio I'm most interested in is the part that isn't done yet. So this is a "watch it converge," not a "build on it."

Roadmap change: we were planning to invest a quarter in stitching together our own data+ML lineage tooling. That project is now scoped down to "evaluate whether SageMaker Unified does enough of it once it's GA." The honest answer is probably, with caveats; we'll know by end of Q1.

3. Bedrock Marketplace + the Nova family — pricing pressure on every model we use

What happened: AWS announced Amazon Nova — a family of in-house foundation models (Micro, Lite, Pro now GA, with Premier coming in early 2025) that AWS is pricing at least 75% under the best performer in each intelligence class on Bedrock. Separately, Bedrock Marketplace opened the door to over 100 additional models — Mistral's NeMo Instruct, TII's Falcon, Writer's Palmyra-Fin for finance, biology-specific models from EvolutionaryScale, and a long tail of specialty and fine-tuned models. They reach through the same Bedrock APIs, but the mechanism is worth noting: a Marketplace model deploys onto a SageMaker endpoint you provision, so it's not the serverless, pay-per-token shape the first-party Bedrock models have. That changes the cost math for anything I'd run through it.

Why this matters: the model layer in our application has been a Claude-3.5-Sonnet monopoly for nine months. That was fine when Claude was clearly the best fit; it's stopped being a defensible monopoly now that Nova Pro exists at lower price and Marketplace has fine-tuned specialists for specific tasks.

Roadmap change: we're going to run our eval suite against Nova Pro and a couple of Marketplace models in Q1 to see if we can split the work — Nova for the high-volume cheap-classification path, Claude for the harder reasoning path. If the eval comes out favorable, that's another 30 – 40% cost reduction on top of prompt caching.

This is the most interesting kind of roadmap change: not a capability change, but an optionality change. The work we do is now eval-driven on the model choice, not vendor-driven.

Three things I'm not (yet) changing my roadmap for

To be honest about scope:

Aurora DSQL. Announced as a distributed SQL database with multi-region active-active writes. Genuinely interesting; we don't have a workload that needs it today. I'm watching it for the next greenfield service.
S3 Tables. Apache Iceberg natively on S3, with automatic compaction and snapshot management. We have an Iceberg-on-S3 setup we maintain ourselves. The migration to managed S3 Tables is a Q2 candidate, not a Q1 one.
Trainium2. New chip, big speedups for model training. We don't train foundation models, so this is a "watch the per-token pricing on Nova" effect for us, not a direct change. The teams that do train are absolutely going to look at it.

The pattern that's emerging

The caveat that runs under all three items is maturity. Two of the things I care about most are still in preview, and a preview can move its price, API, or cache TTL before GA — so I sort the roadmap by what I can build on now versus what I can only plan in pencil:

Three patterns I noticed across the announcements:

AWS is making the LLM-app stack cheaper to operate, not more capable. Prompt caching, model distillation, intelligent prompt routing — all cost-optimization features for workloads that work. The capability frontier moved less this year; the cost frontier moved a lot.
The "managed AI" pattern is converging with the "managed data" pattern. SageMaker unification, S3 Tables, Aurora DSQL — these are all bets that the data infrastructure under an AI app is the infrastructure under any modern app. We're going to talk about "AI platforms" less in 2025 and "data platforms with AI on top" more.
First-party AWS models are credible threats to first-party choices. Nova Pro is not Claude 3.5 Sonnet. It is, however, good enough that the choice now requires running an eval. That was not the case at this time last year, when Titan was the only Amazon-built foundation model and nobody was using it for anything serious.

The flight home

I'm typing this on the plane. The list of announcements is four pages; the list of roadmap changes is three items. That ratio is normal for re:Invent and I think it's healthy.

The temptation, especially as a leader, is to come back from a conference like this and want to rewrite the roadmap to chase what you saw. Don't. The roadmap that was right two weeks ago is mostly still right. Three new things changed; the rest didn't.

Back to work tomorrow.

← part 06 ↑ overview

open on its own page ↗ next: part 08 →

Part 08 of 14

Building with AI/ML — Products & Operations · part 08

Feb 25, 2025

Knowledge Bases for Bedrock in production — the gotchas list

Seven months running Bedrock Knowledge Bases in production. Five things the docs don't tell you, three workarounds that earned their keep, and one place I'd still build my own.

We moved a customer-facing retrieval feature onto Knowledge Bases for Amazon Bedrock last July. The service itself had been GA since re:Invent 2023 — we just sat on it for the better part of a year, watching other people find the sharp edges before we trusted our own retrieval to it. It's been seven months since we made the jump. The feature is in production, our team has stopped maintaining the homegrown RAG pipeline we replaced, and most days the managed version is better than what we had.

Most days. Not every day. Here's the list of production gotchas we've collected, the workarounds we landed on, and the one place I'd still build my own.

Five things the docs don't tell you

Knowledge Bases runs the whole RAG pipeline for you — your docs in S3 get chunked, embedded into vectors, indexed, then retrieved and fed to a model at query time. That's the pitch, and it's real. The gotchas all live at the seams: where you chunk, where you retrieve and filter, when the source syncs, and what it costs to change your mind and re-ingest.

One: chunking strategy is the knob. The default fixed-size chunker (300 tokens, 20% overlap) is fine for prose-heavy docs. For our docs — a mix of API references, code samples, and tutorial narrative — it was terrible. Code samples got cut mid-function; API parameter tables got split across chunks. Retrieval quality looked random until we switched to hierarchical chunking and added semantic chunking for the prose sections.

The lesson: eval before you trust the defaults. Run a fixed eval set against three chunking strategies before committing. You'll save yourself two months of "why is retrieval mediocre."

Two: hybrid search is opt-in and you absolutely want it. The default retrieval mode is dense vector search. For queries with specific identifiers, error codes, or product SKUs, dense search alone misses recall — the model has never seen E014_OVERTORQUE and doesn't know it should be treated as a literal. The hybrid mode (which landed for OpenSearch Serverless back in spring 2024, before we'd even adopted KB) combines lexical (BM25) and vector retrieval. We turned it on six weeks in and recall on the "exact code lookup" class of queries jumped from 60% to 95%.

Three: metadata filters are powerful but quietly limited. You can attach metadata to each chunk at ingest and filter on it at retrieval time. Useful. The gotcha: filters are evaluated after the initial retrieval, so if your filter is restrictive (e.g., product == 'X' when only 5% of the KB is about product X), you can get back zero results because the top-K dense matches were all about other products. The workaround: increase the retrieve count substantially before filtering, or shard the KB by major dimension.

Four: re-ingest is your worst-case operation. Modifying chunking or embedding settings requires re-ingesting the whole KB. For a small KB this is fine. For ours (40K documents, mixed sources) it takes about four hours and costs a few hundred dollars in embedding calls. Plan for this. We have one engineer's day per quarter budgeted to "re-ingest after settings change."

Five: the data-source sync is eventually consistent on the order of hours. When you upload new content to the S3 bucket backing the KB, the docs say "sync the data source." That sync is not instant. For a few thousand docs it's about 15 minutes. For larger updates we've seen it take over an hour. Don't write code that assumes "uploaded the file, next query returns it." It won't.

Three workarounds that earned their keep

A custom retrieve-then-rerank step. The managed retrieval is OK, but for most of our run there was no managed reranking — so we built our own. We added a small Lambda that retrieves 20 chunks via the KB API, then reranks with a cheaper Bedrock model (Nova Micro now; was Haiku before) before passing the top 5 to the answering model. This added ~250 ms of latency and improved answer quality enough that our eval scores went up 8% on the harder categories.

AWS shipped a managed Rerank API at re:Invent in December — you flip on a reranker model (Amazon Rerank 1.0 or Cohere Rerank 3.5) right in the Retrieve / RetrieveAndGenerate call, no Lambda. I'm evaluating it against our hand-rolled step now. Early read: the managed reranker is competitive on quality and obviously less code, but our custom step uses a cheap generative model and lets us inject a little task-specific instruction into the ranking that a dedicated reranker model doesn't take. I haven't decided. If the managed numbers hold, I'll happily delete the Lambda — getting out of the plumbing business is the whole point.

Per-query retrieval count. Some questions are narrow ("what's the parameter for X"); some are broad ("give me an overview of Y"). We added a small classifier (one more Bedrock call) that picks K = 3 for narrow questions and K = 10 for broad ones. Cost: trivial. Quality: noticeable.

A separate "fact" KB and "narrative" KB. Our docs include both reference material (facts) and tutorials (narrative). Mixing them in one KB hurt — retrieval would pull tutorial paragraphs when the question wanted an exact parameter. We split them into two KBs and route queries to one or the other (or both) based on the classifier above. Better separation of concerns, easier to evolve independently.

The one place I'd still build my own

Multi-step retrieval over structured data. Knowledge Bases got structured-data support at re:Invent last December — KBs can now answer questions over SQL-shaped data by generating queries. We tried it. For our use case (joins across more than two tables, with business-rule filters layered on), the generated SQL was hit-or-miss and the round-trip latency was too high.

We're still maintaining our own structured-data pipeline (Glue catalog → Athena → schema-aware prompt to a Bedrock model). It's more code than the managed version would be, but the quality is materially better. I'd revisit if AWS improves the structured-data path in the next six months.

What changed last week

Claude 3.7 Sonnet dropped yesterday and is already available in Bedrock. We tested it against our eval set this morning. Mixed — better on multi-step reasoning, comparable on retrieval-grounded Q&A (which is what KB feeds). For our RAG feature, we'll stay on 3.5 Sonnet for now. For the agent feature (different post), 3.7 looks like an upgrade.

DeepSeek-R1, which dropped last month and made every news cycle, isn't on Bedrock yet but is available via Bedrock Marketplace. Worth watching; the cost shape is genuinely different from anything else available.

The bigger lesson

Managed services give you the 80% case quickly and force you to understand the system better to handle the remaining 20%. Before we used Knowledge Bases, we maintained 1,200 lines of RAG plumbing and pretended we understood it. After: 0 lines of plumbing, much deeper understanding of chunking, retrieval modes, and reranking — because the gotchas are now the only thing left to think about.

I'd take that trade again. Managed RAG isn't the whole product. It's the thing that gets you out of the plumbing business so you can argue about the actual quality knobs.

Six months from now I'll have moved one or two of the workarounds onto whatever AWS ships in mid-2025. The list will look different. The discipline won't.

← part 07 ↑ overview

open on its own page ↗ next: part 09 →

Part 09 of 14

Building with AI/ML — Products & Operations · part 09

Mar 19, 2025

An open-source eval starter kit for product managers

Five files, one notebook, one rubric. The point is to stop talking about evals and start running them — on a Wednesday.

Most product teams ship LLM features on vibes. Someone tweaks a prompt at 4pm, it feels better in the staging window, and they merge. Three weeks later a quiet regression shows up in support tickets nobody can trace.

Vibes are a feature flag without a kill switch. The fix isn't a platform — it's an eval loop you can run on a Wednesday afternoon.

I put one in a repo. Five files, one notebook, one rubric:

→ github.com/drlukeangel/Eval-Starter-kit-Product-Management

You can git clone it and have evals running in fifteen minutes, or you can skip the keys entirely and paste the rubric template into ChatGPT or Claude to score one example at a time. Both modes are in the README.

The 4-step PM eval playbook — golden dataset, metrics, scoring, CI

What's actually in there

An evaluation starter kit replaces subjective vibe checks with automated testing plus AI-driven observability. The three pieces every kit needs are:

A golden dataset — real-world inputs paired with the ideal responses.
A scorer — usually a stronger judge LLM grading on a written rubric.
A runner — to benchmark a prompt or model change end-to-end and tell you whether it got better, worse, or the same.

This repo gives you a minimal one of each:

File	Job
`golden_dataset.jsonl`	30 PM-flavored prompts + ideal answers
`rubric.md`	Four axes graded 1 – 5
`judge.py`	LLM-as-judge — scores a single response
`eval.py`	The runner — model under test → judge → report
`test_evals.py`	`pytest` integration so evals run in CI
`eval_walkthrough.ipynb`	Notebook walkthrough for a standup demo
`prompt_template.md`	Copy-paste mode — no API key needed

Stack is intentionally boring: python, pytest, openai. If you have an OpenAI key, python eval.py does the whole thing. If you don't, prompt_template.md is the whole rubric folded into a prompt you paste into any chat.

The 4-step PM playbook

The repo encodes one loop. Run it on every LLM feature you own.

1. Define the golden dataset. Compile 30 to 50 real user queries and the responses you'd want to see. Weight the dataset toward (a) the most common scenario, (b) the failure mode that would be expensive to ship, and (c) the awkward edge cases the model always trips on. Avoid synthetic examples that sound nothing like real users.

2. Set your metrics. What does good mean for this feature? The defaults in the repo are factual accuracy, tone, format adherence, and hallucination rate — solid PM-grade axes. Keep the count small (≤ 5) so the signal stays legible.

3. Choose your eval method. Two flavors, and you usually want both:

Deterministic — exact string match, regex, JSON-schema. Use this whenever the answer is exact ("does the JSON parse?").
Model-based — a stronger model like gpt-4o or claude-3.5-sonnet grades semantic quality on a 1 – 5 scale. Use this for tone, faithfulness, and anything else where there's no single correct string.

4. Wire it into CI/CD. Run the eval on every prompt change or model version bump. Fail the build when the score drops below your threshold. Thirty examples through gpt-4o-mini plus a gpt-4o judge is pennies per run. Run it on every pull request without thinking about cost.

The mechanical shape of that gate is the whole reason this works — the score stops being a number someone eyeballs in a notebook and becomes a pass/fail that blocks a merge, same as a unit test:

How the eval becomes a regression gate in CI. A prompt or model change opens a pull request; the runner replays the golden dataset through the model under test, the judge scores each response on the rubric, and the mean score is compared to a threshold. At or above the bar the build passes and the change can merge; below it the build fails and the regression is blocked before it ships.

Choosing a heavier framework once you outgrow it

This kit covers the first 80% so you can decide which 20% you actually need. When you outgrow it, four open-source frameworks worth knowing:

Arize Phoenix — best for privacy-first teams that need a self-hosted observability layer. Excels at tracing multi-step agents; ships with built-in prompt management.
DeepEval — best for Python-native teams who want evals to feel like pytest. Local-first, fast, broad assertion helpers.
Ragas — best if your product is RAG-based. Scores faithfulness, context relevancy, and answer relevancy against retrieved context.
Promptfoo — best for CLI-heavy, security-focused teams who want to run bulk evaluations against many LLMs at once.

The shape of the choice is product-shape and team-shape, not language. If you're a small PM team shipping your first LLM feature this quarter, start with this starter kit or DeepEval. Graduate later.

Why I keep coming back to this

Every team I've worked with that didn't have a written rubric ended up arguing about taste on a Slack thread for the third time that month. Every team that did — even a 30-line rubric and a spreadsheet of examples — moved faster and made better calls. The kit isn't sophisticated. The discipline is.

Five files. One notebook. One rubric. Wednesday. Go.

← part 08 ↑ overview

open on its own page ↗ next: part 10 →

Part 10 of 14

Building with AI/ML — Products & Operations · part 10

Apr 15, 2025

The AWS spend audit I do every quarter

Four AWS spend leaks I find every quarter, no matter the org. The audit takes a Friday afternoon and usually pays for the next year of cloud bills.

Once a quarter, the Friday before the budget meeting, I block four hours and audit our AWS spend. I've done this at three different companies now. The four leaks I find are always the same four leaks. Writing them down so the next engineering leader can skip the rediscovery.

The map is always the same shape: four buckets, each with its own tell and its own lever. Here's what I'm looking for before I open Cost Explorer.

Leak 1: idle dev / staging compute

The single biggest finding, every time.

The shape: a dev account, half a dozen RDS instances and EC2 boxes from old projects, running 24/7 at full size. The team that owned them has rotated, the projects have shipped or died, nobody's looked at the dashboard in six months. The bill is 70% steady-state.

What I do: filter Cost Explorer to the dev / staging accounts, sort by service descending, find anything > $200/month, ask the owning team (look up tags, fall back to "ask in #engineering") whether it's still needed. Schedule the answer: weekday-only auto-stop for "yes but only during the day," delete for "no," reservation for "yes and constant."

Typical find at our company: $14K/month, mostly in three forgotten RDS instances and one auto-scaling group that scaled up during a 2023 load test and never scaled back down.

Leak 2: storage that should be cheaper

S3 standard everywhere, EBS gp2 instead of gp3, snapshots accumulating forever.

The shape: every team's S3 buckets default to S3 Standard. Most of the data is accessed once and then sat on indefinitely — logs, analytics dumps, backups. They should be on Intelligent-Tiering or, for proven cold data, Glacier Instant Retrieval.

What I do:

Run S3 Storage Lens across the org. Look at the "% of data not accessed in 90 days" stat per bucket. Anything over 50% should be on Intelligent-Tiering at minimum.
Audit EBS volumes. Anything on gp2 should be on gp3 (same or better performance, ~20% cheaper). Anything on io1/io2 should be inspected for whether it actually needs provisioned IOPS.
Audit EBS snapshots. There's almost always a snapshot policy that runs daily and never expires. Set a retention policy. The first run will delete a lot.

Typical find: $4K – $8K/month. The gp2 → gp3 conversion alone is usually $1K/month at scale.

Leak 3: data egress and inter-AZ chatter

The leak that's hardest to see and hardest to fix.

The shape: services in one AZ talking to a database in another, or a service in one VPC talking through a NAT gateway to S3 instead of through a VPC endpoint. Each individual call is fractions of a cent; the bill adds up to four figures a month because the architecture has the wrong topology and there are billions of calls.

What I do:

Pull the VPC Flow Logs or use Network Manager's "Cross-AZ data transfer" report. Look for unexpected cross-AZ traffic between services that should be co-located.
Audit NAT Gateway data processing charges. If you're using NAT to reach S3 or DynamoDB, you're paying for nothing — those have VPC endpoints (Gateway endpoints, free). Audit which services should be on Interface endpoints (paid but cheaper than NAT at any real volume).
For multi-region: check whether replication is actually used. Cross-region replication on S3 is fine if you need it; if you don't, it's a 2x storage bill for no reason.

Typical find: $2K – $5K/month. The NAT-to-S3 trap is the most common single fix; one VPC endpoint deployment can save $1K/month immediately.

Leak 4: oversized everything

The leak the team will defend the hardest.

The shape: every EC2 instance, every RDS class, every Lambda memory setting was picked on day one based on a guess. Nobody has revisited. Compute Optimizer has been telling you to right-size for nine months. Nobody has looked.

What I do:

Open Compute Optimizer. Look at the recommendation list. Find anything with "high savings opportunity" labeled.
For Lambda specifically: most teams overprovision memory because they read once that "more memory = more CPU" and never re-evaluated. Right-sizing Lambda is measurably my fastest single source of savings — about 30% cost reduction across our Lambda footprint last quarter, ~$3K/month saved.
For RDS / Aurora: look at CloudWatch Insights for CPU and memory utilization. Anything chronically under 40% should be one size down.

Typical find: $5K – $15K/month, depending on size of fleet. Often the team will push back ("but we might need it during a spike"). The right answer is auto-scaling, not over-provisioning the baseline.

The picture is the whole argument: you can pay for the spike's headroom every hour of every day, or you can pay for it only in the hours the spike shows up. Provisioning the baseline to the peak is buying insurance you already have — that's what auto-scaling is.

What I don't audit

A few things I deliberately don't touch on quarterly audits:

Reserved Instances / Savings Plans changes. Those are annual commitments — I revisit once a year, in tandem with our capacity-planning conversation, not quarterly.
Bedrock cost optimization. That's a different audit. Bedrock prompt caching went GA last week — finally — and it's already the biggest single lever I have on our LLM bill (it bills cached input tokens at a fraction of the read rate, so the long, repeated system-prompt context stops costing full freight on every call). Intelligent prompt routing is the other lever I'm watching, but it's still in preview as I write this, so I'm evaluating it, not committing a budget line to it yet. I do a separate AI-cost audit monthly because the workload is moving too fast for quarterly to keep up.
Anything below $100/month. Time is finite. The audit budget is $5K minimum-find before I'll pull the thread.

The script that runs every Monday

Beyond the quarterly audit, the team runs a small Lambda every Monday at 8am that posts to a Slack channel:

This month's spend vs the same period last month
Top three services by spend
Anything that crossed +20% week-over-week

Most weeks the channel is boring. The week when it isn't, we catch the leak the same week instead of three months later.

The quarterly audit finds the big structural leaks; the weekly Lambda is what keeps a new one from running unnoticed for a full quarter between audits. Different tools for different failure modes.

The framing that lands with execs

When I present the spend audit findings to the CFO, the framing I use is:

AWS doesn't optimize your bill. AWS optimizes their revenue. The two are not the same hobby. Every dollar that's optimizable is a dollar AWS has no incentive to surface unless you go look for it.

That sentence has unlocked more "yes, do the audit" from leadership than any cost-savings number has. It's not anti-AWS; it's accurate. AWS Cost Explorer is a real product, but Cost Explorer makes data discoverable, not actionable. Actioning the data is a human in a chair, four hours a quarter.

That human is me, on Fridays before budget meetings. The same four leaks, every time.

← part 09 ↑ overview

open on its own page ↗ next: part 11 →

Part 11 of 14

Building with AI/ML — Products & Operations · part 11

Jul 23, 2025

Bedrock AgentCore at Summit NY — what it actually changes

AWS announced Bedrock AgentCore at Summit NY last week. Not the same as Bedrock Agents — different product, different shape. What it actually changes for teams already running agents.

AWS Summit NY was last week. The keynote announced Amazon Bedrock AgentCore in preview. I've spent the last few days reading the docs, the blog posts, and the recordings, and pointing a small internal experiment at it to see if the shape is what I think it is.

The short version: AgentCore is not "Bedrock Agents 2.0." It's a fundamentally different product, aimed at a different problem, and it changes how I'd architect a new agent system today.

What AgentCore actually is

A set of modular, individually-priced services for running agents in production. Specifically:

Runtime — serverless execution environment for agent code (Strands, LangGraph, CrewAI, your own — doesn't matter, AgentCore is framework-agnostic).
Memory — managed short-term and long-term memory store, with built-in summarization and retrieval.
Identity — agent-side IAM, including delegated access to call tools as different end-users.
Tools — managed tool registry with built-in browser, code interpreter, and external API tools (Gateway exposes APIs and Lambda functions as MCP-compatible tools).
Observability — distributed tracing across agent steps, native OpenTelemetry.

You don't have to use all five. You can plug in just the memory layer alongside your existing LangChain stack. Or just the runtime. Or just the observability. That is the meaningful shift.

How it differs from Bedrock Agents

Bedrock Agents — the older product, GA at re:Invent in November 2023 — is a monolithic, opinionated agent stack: the loop, the prompt scaffolding, the tool-calling format, the model choice — all integrated. You define action groups, knowledge bases, and a base model, and Agents orchestrates the whole thing.

AgentCore is the opposite design:

	Bedrock Agents	Bedrock AgentCore
Shape	Monolithic, opinionated	Modular, composable
Framework	AWS-defined loop	BYO (Strands, LangGraph, etc.)
Coupling to Bedrock models	Tight	Loose — any model that supports tool use
Pricing	All-or-nothing	Per-service
Production maturity	GA	Preview

If you're shipping a brand-new agent today, AgentCore is the right shape if you can live with preview status. If you're already on Bedrock Agents in production, you don't have to migrate — but you'll probably want to.

What it changes for us

We've been running agents on a homegrown stack: Strands Agents SDK (open-sourced by AWS back in May) for orchestration, our own memory layer on DynamoDB, our own tool registry, OpenTelemetry to Honeycomb for tracing. About 4,000 lines of code we maintain.

AgentCore replaces roughly 2,800 of those lines with managed services. The work to migrate is real — re-wiring the memory layer is a couple of weeks, swapping our home-rolled OTel pipeline for AgentCore's is another — but it's bounded. Six weeks of engineering for a permanent reduction in surface area.

The math we ran:

Code we maintain: 4,000 → 1,200 lines.
Engineers needed to babysit the agent infra: 1.5 → 0.5.
Per-request cost of memory + tracing infra: ~$0.002 → roughly comparable (AgentCore prices these as managed services; the savings are in maintenance, not unit cost).

The reduction in engineering surface area is the win, not the cost. We get a 0.5-FTE back to build product.

What I'd defer

Preview status is real. Three things to watch before I'd put this fully on the critical path:

GA timeline. AWS said "later this year" at the Summit. That usually means re:Invent (December). If your launch is in Q4, you're betting on a preview holding up.
Multi-region failover. The preview runs in a few regions. Once you've moved your memory layer to AgentCore, you depend on its regional availability. Production-shaped HA isn't yet documented.
Cost predictability at scale. Per-service pricing means agent runs that previously had a single $0.18 model cost now have a $0.18 model cost plus a memory call cost plus a tool registry cost plus a trace cost. We're modeling it on our actual workload; the answers depend on workload shape.

The thing nobody's saying out loud

Bedrock Agents is going to feel slowly deprecated.

AWS won't say that — they'll keep the product around — but the architectural energy has moved to AgentCore. Every example in the Summit talks, every reference architecture, every new feature announcement is in AgentCore terms. Bedrock Agents will continue to exist; it'll continue to get bugfixes. The new patterns are not going to land there.

If you're starting today: AgentCore. If you're already on Bedrock Agents in production: ride it, migrate when AgentCore goes GA, expect to migrate.

Where this fits in the broader picture

Back in May, AWS open-sourced Strands Agents SDK — a Python framework for building agents that's framework-agnostic at the model layer. Now AgentCore.

The pattern, looking back: AWS is shipping the unbundled version of "agent infrastructure" piece by piece. Strands is the SDK. AgentCore is the runtime + memory + observability. Bedrock provides the models. Q Developer provides the developer-facing agent surface.

This is the same playbook AWS ran with Lambda + API Gateway + DynamoDB ten years ago — provide the pieces, let customers compose, win on the composition. It worked then. The signs are that it'll work for agents too. AgentCore is the keystone piece that makes the composition tractable.

What I'd do this week

If your team is shipping agents:

Spin up an experiment in a non-production account. Take a single agent feature, put it on AgentCore Runtime with whatever framework you already use.
Wire AgentCore Memory in alongside your current memory layer. Run both in parallel for a week. Compare what you get.
Decide by end of August whether AgentCore is on your post-re:Invent migration list.

We're doing this. I'll write a follow-up in November once we've had AgentCore through a real load test. For now: this is the most consequential agent announcement of 2025, and it's going to reshape how we architect agent systems for the next several years.

← part 10 ↑ overview

open on its own page ↗ next: part 12 →

Part 12 of 14

Building with AI/ML — Products & Operations · part 12

Nov 26, 2025

Strands + AgentCore — a year-end agent-stack inventory

AgentCore went GA in October; we've run it in production since the preview in August. Seven months on Strands, plus a tour of the alternatives. What I'd recommend for a 2026 agent project, by team shape and constraint.

A reader emailed last week asking what stack I'd recommend for a new agent project starting in early 2026. I've been on the phone with three CTOs in the last month asking variants of the same question. Year-end inventory post, written in service of that conversation.

Where I land, by team shape

Team shape	What I'd pick	Why
AWS-native, agents in production, > 5 engineers	Strands SDK + AgentCore	Best operational story, lowest maintenance burden, GA since October
AWS-native, first agent, small team	Bedrock Agents (still)	Lowest cognitive cost, fully managed, GA'd two years
Multi-cloud or vendor-agnostic	LangGraph + your own infra	Best portability, biggest community
Heavy multi-agent, role-based orchestration	CrewAI	Cleanest multi-agent abstraction
Research / fast iteration, willing to write infra	DSPy or LangGraph	Most expressive
Microsoft shop	AutoGen / Semantic Kernel	Better Azure / M365 integration

If you came here for the punchline and you're an AWS-native team running real production traffic, the short answer is: Strands + AgentCore. The rest of the post is why.

What changed in 2025

Year-end review of what shipped:

Strands Agents SDK (open-sourced by AWS, May 2025) — Python framework for agent orchestration. Model-agnostic, framework-agnostic. Replaces the "I'll write my own agent loop" instinct with something AWS now maintains.
Bedrock AgentCore (announced in preview July 2025 at NY Summit; GA October 13) — modular runtime + memory + gateway + tools + observability + identity. Composable; you can adopt one piece without the rest. GA brought the things that were missing in preview: VPC + PrivateLink, CloudFormation, resource tagging, eight-hour runtime sessions, A2A protocol support, and Gateway connecting to existing MCP servers (not just wrapping your APIs and Lambdas).
Bedrock Marketplace (re:Invent 2024 → matured throughout 2025) — third-party models behind the same Bedrock API. DeepSeek, more Mistrals, specialized fine-tunes.
Bedrock Prompt Caching (GA early 2025) — order-of-magnitude cost reduction on repeated prefix prompts. The single biggest lever for agent cost.
AgentCore Browser + Code Interpreter tools (managed) — replaces the "let's run Playwright in our own Lambda" pattern with a managed alternative.

The story across all five: AWS is shipping the unbundled pieces of agent infrastructure. You get to compose. The composition story is now actually tractable.

How we use Strands + AgentCore

We rode the preview through real traffic from August, then cut over to the GA APIs in the two weeks after October 13 — mostly a matter of re-pinning the SDK, moving our runtime into a VPC now that GA supports it, and putting the whole thing under CloudFormation. Our agent stack today:

Strands SDK for the agent loop. Python, model-agnostic, exposes a clean event-driven hook system.
Claude 3.7 Sonnet as the primary reasoning model. Claude 3.5 Haiku for cheap-classification routing. Nova Pro on a few high-volume paths where the cost differential matters and the quality is good enough.
AgentCore Memory for short-term and long-term memory. Replaced our DynamoDB layer.
AgentCore Gateway for tool registry. Exposes 20-odd internal APIs as MCP-compatible tools. Replaced the registry we'd built ourselves.
AgentCore Runtime for the production execution surface.
AgentCore Identity for delegated authorization (the agent calls tools as the end-user, not as a service principal).
Bedrock Prompt Caching for the shared system prompt and tool definitions across loop iterations.
OpenTelemetry → Honeycomb for traces (AgentCore emits native OTel; we kept Honeycomb).

That stack runs about 80,000 agent invocations per day. We maintain about 1,200 lines of code on top — almost all of it business logic and tool implementations. Down from 4,000 lines before the migration.

What I'd defer

Three things to be honest about, even now that AgentCore is GA:

We rode the preview to get here, and that was a real bet. GA landed October 13, but we'd already been on the preview APIs in production since August because we believed the architecture. That paid off — but if you're starting now, you start on GA, and you don't have to make the bet we did. The honest read: GA closed most of the gaps that made the preview a gamble.

Multi-region failover is still mostly on you. GA added VPC, PrivateLink, and CloudFormation support, and the runtime now spans nine regions — but cross-region active-active for your memory and gateway state isn't a turnkey feature. If your business demands it, you're still writing parts of this yourself.

Cost predictability under bursty workloads. AgentCore's per-service pricing means agent cost is now (model) + (memory ops) + (gateway/tool ops) + (runtime time). Modeling cost under spikes requires more care than it did with the monolithic Bedrock Agents pricing. GA didn't change the pricing shape; it just made it the shape you're committing to.

When I'd reach for the alternatives

Honest assessments of where the others win:

LangGraph + your own infra. Win: framework portability and community velocity. The LangGraph community ships patterns faster than AWS does. Win: multi-cloud is real (we run a backup of one agent on Azure OpenAI for compliance reasons; LangGraph handles this without complaint). Loss: more infra you maintain.

CrewAI. Win: best abstractions I've seen for multi-agent role-based orchestration. "The researcher agent talks to the writer agent talks to the editor agent" is a paragraph of CrewAI code. Loss: smaller community, less production-tested.

DSPy. Win: most expressive way to compose LM programs. Lets you optimize prompts via compilation, not hand-tuning. Loss: still feels research-y. The teams I know running it in production are the teams with research-shaped engineers.

AutoGen / Semantic Kernel. Win: if you're a Microsoft shop, the Azure OpenAI + Semantic Kernel + M365 integration is genuinely tighter than AWS's equivalents. Loss: if you're not a Microsoft shop, you're swimming upstream.

What I'm watching for next month

re:Invent 2025 is December 1-5, and with AgentCore already GA, the announcements I'm watching for are the layer above the runtime. Three things:

A managed eval story specific to agents. Bedrock Model Evaluation exists; it's not yet agent-aware — it grades a model on a prompt, not an agent on a multi-step task. The teams I talk to are all writing their own agent-specific eval setups (mine on top of the open-source eval kit). With the runtime now GA, this is the obvious next gap for AWS to close.
AgentCore feature depth post-GA. GA shipped the enterprise table-stakes — VPC, CloudFormation, A2A, MCP-server gateway connections. What's missing is the operational depth: turnkey cross-region state, finer cost controls, richer Memory strategies. re:Invent is where I'd expect the first post-GA wave.
Better cross-cloud / cross-model abstraction. Strands is half this story; AgentCore is half. The third half — true vendor-agnostic agent observability and lineage — is still missing.

The honest framing

The most important framing I've landed on this year: the framework choice matters less than people make it.

A good agent system is mostly: a clear task definition, a good eval set, a tight rubric, well-scoped tools, and a model that's right-sized for the task. Strands vs LangGraph vs CrewAI is a coordination-cost decision, not a quality decision. The teams that ship great agents are the teams with great evals. The framework is downstream.

That said: ride the maintenance curve. If you're an AWS team, having AWS maintain the framework and the runtime saves you a person. Worth it.

re:Invent in a week. I'll write a follow-up if anything significant lands in the agent space. For now: pick the framework that matches your team's experience, not the one with the loudest blog posts, and go ship the eval set first.

← part 11 ↑ overview

open on its own page ↗ next: part 13 →

Part 13 of 14

Building with AI/ML — Products & Operations · part 13

Mar 26, 2026

Agent-based DevOps with Q Developer — kept vs tossed

Eight months running Amazon Q Developer agents in our engineering org. The four agentic workflows that earned their keep, the two we shut off, and the metric that made the case to keep going.

We've been running Amazon Q Developer agents across our engineering org for about eight months. Started cautiously, expanded carefully, and pruned hard. What's stayed, what's gone, and the metric that made the case to leadership to keep going.

What we kept

1. Dependency-upgrade agent

The biggest win. Q Developer reads our package.json / pom.xml / Cargo.toml on a schedule, identifies dependencies behind the current safe version, opens PRs one at a time, runs the test suite, and labels the PR by risk.

Three categories of upgrade get auto-merged after CI passes:

Patch versions with no breaking-change notes
Minor versions of dev dependencies (linters, type checkers, formatters)
Security-only updates on dependencies flagged by Dependabot

Everything else opens a PR for human review.

Why it stuck: dependency upgrades are toil. They're not interesting. They're a never-ending background workload that gets deferred until something breaks. An agent that handles 80% of them automatically and surfaces the 20% that need judgment is exactly the right shape for this kind of work.

Number on it: we landed 1,200 dependency-upgrade PRs in 2025 across our repos. Of those, ~950 were auto-merged. The engineering time that recovered is roughly half an engineer-year. Real money.

2. PR review agent (advisory, not blocking)

Every PR gets an automated review pass from Q Developer. It comments on:

Tests it thinks should exist but don't
Edge cases it suspects aren't handled
Patterns inconsistent with the rest of the codebase

The agent's comments are advisory — never blocking. A human reviewer is still required to approve. The reviewer can take or ignore the agent's suggestions.

Why it stuck: it doesn't replace review, it prompts better review. We measured this: human reviewers leave 30% more substantive comments on PRs that the agent has commented on first. The agent isn't smarter than the reviewer; it just primes the reviewer to look harder.

Important nuance: the agent's comments are visibly attributed to the agent. We tried hiding the attribution for a month to see if reviewers would treat them as peer comments. They didn't — they trusted them less. Attribution is a feature, not a bug.

3. Documentation-drift agent

Every time a public API changes in a service, the agent checks whether the corresponding docs were updated in the same PR. If not, it opens a follow-up PR adding the doc changes. The author can edit or close the suggestion.

Why it stuck: doc drift is the single biggest support-ticket source we have. The agent catches about 70% of API changes that would have shipped without doc updates. The other 30% it misses are typically renames or refactors where the doc update is ambiguous.

4. Incident-investigation agent

When PagerDuty fires, the agent automatically pulls together a brief: recent deploys, recent flag flips, related alerts in the last 24 hours, log samples from the affected service, and links to similar incidents from the runbook. Posts it as the first comment in the incident channel.

Why it stuck: the first 10 minutes of an incident are exactly the same 10 minutes every time — somebody pulling up the deploy log, somebody else searching logs, somebody else looking for prior runs of the same incident. Having an agent assemble the brief while humans are still typing "/oncall" buys back those 10 minutes.

Important caveat: the agent does not take action. No flag flips, no deploys, no anything that mutates state. It assembles the brief; humans decide.

That caveat isn't specific to the incident agent — it's the line that runs under all four kept workflows. Every one of them assembles, suggests, or drafts; not one of them mutates state on its own. The human is always on the trigger.

What we tossed

1. Test-generation agent

Tried for three months. The agent would propose new tests for code that lacked coverage. The tests were syntactically valid, would pass against the current implementation, and were almost universally useless — they tested the implementation rather than the contract, and changed every time the code changed.

We shut it off. Test coverage that doesn't catch real regressions is worse than no coverage; it gives false confidence.

What might work: an agent that generates tests from a specification document (input/output examples, behavioral contracts), not from the existing code. We haven't built it yet.

2. Auto-fix-the-lint agent

Tried for a month. The agent would auto-fix lint and style violations. Sounded great. In practice, it kept "fixing" things in ways that broke contextual decisions an engineer had made deliberately (a // eslint-disable-next-line that was load-bearing, a formatting choice that aligned with a generated file).

We shut it off and went back to humans plus pre-commit. The lesson: the cost of "an agent fixing the wrong thing" is much higher than the cost of "an engineer running npm run lint:fix themselves."

The pattern across the divide is clear once you see it laid out: the agents that stuck were the boring, advisory ones that surfaced judgment rather than substituting for it. The ones we shut off were the impressive ones that quietly produced wrong work a human then had to find and undo.

The metric that kept the program alive

About four months in, our VP of Engineering started asking the right question: is this saving time or generating noise? I'd been measuring agent activity (PRs opened, comments posted) which is the wrong metric — it measures the agent, not the impact.

We switched to engineering-hours-recovered, calculated as:

Hours we'd have spent on the work the agent automated (dependency-upgrade time, PR-priming time, incident-assembly time)
Minus hours we spent reviewing agent output that was wrong or unhelpful
Minus hours we spent maintaining the agent integrations

For the four kept agents, the answer was roughly 0.7 FTE of recovered engineering time per quarter across our 24-person engineering org. For the two tossed agents, the answer was negative — we spent more time reviewing wrong agent output than the agent saved.

That metric made the case to leadership. It also gave us the discipline to shut off the tossed ones without arguing about it. The math is the math.

What I'm watching for next

Three near-term things on the watch list:

Agent-led migrations. AWS Transform is the obvious play here for re-platforming workloads. We have a Java 11 → Java 21 migration coming up. The pilot starts next month.
Cross-repo refactors. "Rename this concept across 40 repos consistently." Q Developer can technically do this; we haven't trusted it yet. Watching for stories from other teams.
Compliance evidence collection. We have a SOC 2 audit every year. The evidence-collection part of it is the most agent-shaped work in the world — go to N places, pull M things, format them into the auditor's template. Not yet productized, but it should be.

The thing I'd tell another VP

If you're considering rolling out agentic DevOps in your org:

Start with one workflow. Pick the boring one — dependency upgrades, doc-drift, incident-brief assembly. Avoid the impressive ones (test gen, refactors) until you've built the eval discipline to know when the agent is wrong.

Insist on a metric. Agent-activity counts will fool you. Recovered-engineering-time will not.

Be willing to shut things off. The hardest part of running an agent program isn't standing things up; it's accepting that some of them aren't working and turning them off without feeling like you've failed.

An agent that's wrong 5% of the time is not a 5% problem. It's a trust problem with a percentage attached. The agent has to be right enough that engineers don't develop the habit of double-checking everything — because that habit eats the savings.

Get to "trust the agent" or shut it off. The middle is the most expensive place to be.

I'll write a year-in review in November.

← part 12 ↑ overview

open on its own page ↗ next: part 14 →

Part 14 of 14

Building with AI/ML — Products & Operations · part 14

Apr 14, 2026

Roadmap reviews when half the work is non-deterministic

The four columns I steal from my therapist for running roadmap reviews when half the engineering work is non-deterministic.

Half the engineering work on an AI product is non-deterministic. The model layer underneath your feature changes on its own schedule. Your roadmap can't pretend it's a Gantt chart and survive.

Here's a template that does. I borrowed it, embarrassingly, from a therapist.

A clean Gantt bar above a wobbly line — same start, same ship date, but the AI feature's path wiggles between them

Why the Gantt chart breaks here

A Gantt bar assumes the work is a straight line between two dates. That's a fine model for "rewrite the billing service" and a terrible model for "make the support copilot actually helpful." Both features sit in the same calendar week. Only one of them has a deterministic floor under it.

When the model layer can move, your roadmap has to make room for the wobble. Re-evaling. Re-prompting. Discovering the thing your CEO loved last sprint is the thing tanking the eval this sprint. A Gantt bar painted over that wobble is just a lie with corporate fonts on it.

What survives is a meeting and a written artifact that names the wobble. That's the whole trick.

The four columns

I sat in a therapist's office a few years back and watched her draw four columns on a yellow legal pad. CBT thought records. What happened. What I thought. What's actually true. What I'll try next. I copied the page on my phone. Two weeks later I ran my standing roadmap review off it.

It worked. Then I did it again. It kept working.

The four columns — Observed, Assumed, Evidence, Committing to — with example rows from a real roadmap review

I stole four columns from my therapist; my engineers stopped sighing.

Translated for a PM running an AI feature, the columns become:

01 · Observed

What actually happened last sprint, in numbers and incidents, not vibes. Eval scores. p95 latency. Support tickets with quotes. Regression flags. Where did the line go. This is the most boring column and the most important one. No editorializing.

02 · Assumed

What we believed when we planned this sprint. The hypotheses, written down honestly. We thought gpt-4o-mini was a drop-in for gpt-4o here. We thought users liked the new date format. Most teams skip this column. That's why their roadmap reviews feel like blame sessions — they're comparing today's evidence to yesterday's vibes, not yesterday's stated bets.

03 · Evidence

What's actually true now, given column 1 minus column 2. This is the reframe. We were wrong about 4o-mini — the eval shows a 9% format-axis drop. We were wrong about users liking the new format — the tickets are from power users, not new ones. This is where the model layer's wobble gets named, out loud, on the artifact.

04 · Committing to

What we're going to do this week. Small. Testable. Reviewable next Thursday. Revert the prompt on prod. Add the failing examples to the eval set. Ship the format fix behind a flag. Crucially — these are not Gantt bars. They're bets, with a re-review date.

How to actually run the meeting

The whole review is ninety minutes. Six features, fifteen minutes each, plus a five-minute buffer at the end for the thing that always blows up.

For each feature, two people in the room get screen time:

The PM reads column 1. Numbers and incidents only. No story yet.
The tech lead reads column 2. What we believed; no defending it.
The two of them write column 3 together, out loud, in the doc. The room can chip in but doesn't drive.
The PM closes with column 4. Three to five lines. A re-review date.

The artifact is the doc, not the slide. Slides are for boards; docs are for the team. If you have to send a slide upstream, the slide is just column 4 with one sentence of context — never the whole record. The record is for honesty, not for sharing.

What changes when you do this

Three things, in order:

Sighs stop, mostly. When the engineers know assumption-naming is a step in the process, the meeting stops being a vibes-court and starts being a notebook entry.
The roadmap shortens. Column 4 won't fit nine items if you're being honest about how much each bet costs to re-review. Two to five per feature is the natural shape.
You catch model drift earlier. Column 2 is the diary. When 4o-mini was a drop-in for 4o six weeks ago and it suddenly isn't, the only place that fact gets written down is here, the second time you fill in the columns.

The disclaimers

This is not a project plan. It is not a stakeholder doc. It is not a substitute for a quarterly. It is a standing weekly artifact that lives next to whatever quarterly mechanism your org already has, and it gives that mechanism real data to feed on.

It also doesn't replace the therapist. You probably still need that one.

But on Wednesdays at 2pm, when the model layer has wobbled again and the eval is six points off where you wanted it, four columns and ninety minutes is the closest thing to a roadmap that an AI product actually has.

← part 13 ↑ overview

open on its own page ↗