The Engineering Discipline Behind Trustworthy AI in Healthcare

The job description flipped and most of us didn't notice.

A year ago, we wrote code and occasionally reviewed it. Now we review code and occasionally write it. The models write the first draft. The craft is in what comes after: reading the diff, making the call, and knowing when the machine got it subtly wrong in ways that won't surface until production.

This post is for engineers who ship with AI tools daily. The people running `Claude Code`, Copilot, Cursor, or similar, and the leads responsible for what those tools produce. If you're still debating whether AI-assisted coding is real, this isn't for you. If you're past that and trying to figure out the quality problem, read on.

"LLM-as-Judge" Is a Ritual, Not a Review

Asking Claude "is this code good?" and getting back "looks great!" is a ceremony. It feels like review. It isn't.

Real review is structured. Each of these is a distinct question, and each benefits from a different approach:

Intent alignment: Does the code do what the plan, ticket, or spec said it would? Not "does it run,” but does it solve the right problem?
Redundancy: Is there dead code, duplication, or a utility in your codebase that already does this? AI models don't know your repo's helpers exist unless you tell them.
Boundary security: Are inputs validated at the edges? Are auth checks present? Are SQL queries parameterized, or did the model reach for string interpolation because that's what the training data looked like?
Simplicity: Is this the simplest implementation that works, or did the model over-engineer a three-layer abstraction for a function that formats a date?

Vibe-checking collapses all four into a single pass. Structured review runs them independently. The difference shows up in the defect rate.

Model Tiers: Cheap for Mechanical, Expensive for Judgment

Not every review question requires the same model. Running Opus on a duplication scan is like hiring a principal engineer to check indentation.

A practical tiering:

Haiku-class: linting, dead code detection, import cleanup, naming consistency, log-level checks. Mechanical, pattern-matchable, high-volume. Run these on every diff, always.
Sonnet-class: logic flow validation, test coverage gaps, error handling completeness, API contract conformance. Requires understanding intent but not deep architectural context.
Opus-class: architectural coherence, cross-service implications, data model evolution, concurrency correctness, "should this abstraction exist at all" decisions. Reserve this for PRs that change interfaces or introduce new patterns.

The key constraint: every tier must be overridable. The engineer reviewing the output needs to be able to escalate a Haiku-flagged issue to Opus, or dismiss an Opus concern as irrelevant. Automation without override is just a different kind of tech debt.

Review Architecture

Tiering models by review type

Tier	What it checks	When to run it
Haiku-class	Linting, dead code, import cleanup, naming consistency, log-level checks	Every diff, always. Mechanical, pattern-matchable, high-volume.
Sonnet-class	Logic flow validation, test coverage gaps, error handling, API contract conformance	Needs intent understanding but not deep architectural context.
Opus-class	Architectural coherence, cross-service implications, data model evolution, concurrency correctness	PRs that change interfaces or introduce new patterns. Pair with human review.

Every tier must be overridable. Automation without override is a different kind of tech debt.

Guards Beat Guidelines

A `CLAUDE.md` that says "don't commit secrets" is a polite request. A pre-tool hook that regex-scans every file write for API key patterns and blocks the call before it happens is a contract.

This distinction matters more than most teams realize:

Secrets in code: Deterministic. Scan for high-entropy strings, known key prefixes (`sk-`, `AKIA`, `ghp_`), `.env` patterns. Block at write time, not at review time.
PII leakage: Scan for SSN patterns, email addresses in test fixtures that look real, phone numbers in comments. Flag or block depending on context.
Context window bombs: A 2,000-line log dump pasted into a conversation eats your context budget and degrades every subsequent response. Detect large pastes, truncate or summarize before they hit the model.
Dependency mutations: If a model adds a new package to `package.json` or `requirements.txt`, that's a security surface change. Flag it separately from the code diff.

The principle: if you can detect it deterministically, don't delegate it to the model. Guards run at zero inference cost, with zero false-negative risk from context window pressure. They're not suggestions, they're constraints.

How to Review AI-Generated Code (A Triage Order)

When you open a diff that a model wrote, your instinct is to read top to bottom. Resist it. AI-generated code has a specific failure signature, and the triage order should reflect that.

Start with what it imported and what it installed.

Models reach for dependencies reflexively. Check if a new package was added. Check if an existing internal utility does the same thing. This is the highest-leverage line of review. A bad dependency outlives the PR.

Read the boundaries, not the middle.

Look at function signatures, API route definitions, database queries, and external service calls first. The middle of a function — the transformation logic — is usually where models perform best. The edges — validation, auth, error handling, serialization — are where they cut corners.

Look for the code that shouldn't exist.

Models are completionists. They generate exhaustive switch statements, redundant null checks, wrapper functions that add nothing, and abstractions for things that occur exactly once. Ask: "what would I delete if I wrote this by hand?"

Check the tests: not for coverage, but for intent.

AI-generated tests often have high coverage and low signal. They test that the code does what the code does, not that it does what the spec requires. Look for tests that encode business rules, not implementation details.

Run the mechanical checks last, or automate them entirely.

Formatting, naming conventions, import ordering, this is Haiku-tier work. Don't spend human attention on it. If your pipeline doesn't already catch it, add a pre-commit hook and move on.

The Architecture: Review as the Base Layer

The shift I keep building toward:

Iteration loops, not single-shot generation: The unit of work isn't "prompt → code → ship." It's "prompt → code → review → fix → review → ship." Each review pass has a defined scope. The first pass is mechanical (guards + Haiku). The second is structural (Sonnet). The third, if the diff warrants it, is architectural (Opus + human).

Skills loaded on demand: A review agent doesn't need your entire codebase context in every conversation. It needs the relevant schema when reviewing a migration. It needs the API spec when reviewing a route handler. It needs your team's error-handling conventions when reviewing a service client. Load what's needed, discard what isn't. Context discipline is cost discipline.

Humans in the loop where judgment compounds. The models get cheaper every quarter. Judgment does not. The right question isn't "can the model do this?" but "does this decision have consequences that outlast the sprint?" If yes, a human reviews it. If no, the pipeline handles it.

The Meta-Point

The tools that win in this era will be the ones that catch mistakes before they land, structure review so nothing slips through the cracks, and keep engineers focused on the decisions that actually matter.

We have more code being written than ever. The bottleneck moved. It's in review now. Treat it accordingly.

The Engineering Discipline Behind Trustworthy AI in Healthcare

"LLM-as-Judge" Is a Ritual, Not a Review

Model Tiers: Cheap for Mechanical, Expensive for Judgment

Tiering models by review type

Guards Beat Guidelines

How to Review AI-Generated Code (A Triage Order)

Start with what it imported and what it installed.

Read the boundaries, not the middle.

Look for the code that shouldn't exist.

Check the tests: not for coverage, but for intent.

Run the mechanical checks last, or automate them entirely.

The Architecture: Review as the Base Layer

The Meta-Point

Why home healthcare’s next big wave is ‘digital labor,’ not more software

The Engineering Discipline Behind Trustworthy AI in Healthcare

Get started with Arya