Should AI-Generated Medical Notes Get a Second Opinion? Benchmarking the Judge-LLM Pattern (April 2026)

Judge-LLM validation study · April 2026

We added a second LLM to our medical-note pipeline whose only job is to read the first model's output, compare it back to the original transcript, and fix what's wrong. We tested four variants of this "judge" architecture against the single-LLM baseline.

Why a judge LLM at all?

Single-LLM medical note generation has a structural problem: the model can't reliably catch its own mistakes.

When GPT-4.1 hallucinates "Uveitis anterior" because the patient mentioned non-specific eye symptoms, it doesn't know it hallucinated. When it confidently writes "no daily finger pain" because the patient said "not every day anymore" (which means less frequent, not zero), the model has decided this is its best interpretation and won't reconsider.

A second model — looking at the same transcript with no commitment to the first model's text — is far more likely to flag these. This is the classical second pair of eyes pattern from clinical practice itself.

The question isn't whether a judge helps. It's how to prompt it without making things worse.

Methodology

Test setup

3 baseline notes generated by GPT-4.1 (Nixi's previous production model) on three real consultations: one moderate (Duloxetin tapering), one complex (Leflunomid flare with refusal documentation), one simple (wrist pain / osteopenia).
These notes were scored manually against the same 10-criterion physician-feedback rubric used in our model-selection benchmark to set a baseline.
Each note was then sent to the judge LLM with a defined prompt; the corrected note was re-scored.
All notes were generated and corrected via Azure OpenAI / Azure AI Foundry. We tested both Claude Sonnet 4.5 and GPT-4.1 as judges.

Evaluation criteria

10-dimension rubric (same as model-comparison study)

Dimension	Weight
Factual completeness	20%
Hallucination control	15%
Therapy status accuracy	15%
Template-instruction compliance	15%
Medical terminology precision	10%
Language style & register	10%
Section placement	5%
Three supporting criteria (combined)	10%

Each criterion scored 1–10 per note, weighted into a 0–10 overall. Verdict thresholds: PASS ≥ 8.5 · NEEDS_REVIEW ≥ 6.0 · FAIL < 6.0.

Architectures tested

1. Single LLM (production baseline before this study)

Transcript → Generator (GPT-4.1) → Final Note

One LLM call. No second pair of eyes. Hallucinations are uncatchable.

2. Two-LLM with weak judge prompt

Transcript → Generator → Note → Judge ("check for errors") → Corrected Note

The judge gets a generic instruction: "Check the note against the transcript for hallucinated facts, dropped facts, wrong medications, and inaccurate diagnoses. If you find errors, output a corrected version." No template, no terminology dictionary, no specific evaluation criteria.

3. Two-LLM with strong judge prompt (production candidate)

Transcript → Generator → Note → Judge (5-step rubric) → Corrected Note

The judge gets:

A 5-step quality framework: Hallucination check → Completeness check → Section assignment → Template compliance → Terminology check
The full template instructions (so it knows what correct looks like)
The terminology dictionary (so it knows the right form for each drug)
Explicit "do not change" rules — therapy statuses, dosages, section assignments, JSON structure, original [unklar] flags
A "minimum-changes" constraint so it doesn't rewrite for stylistic preference

4. Three-LLM extractor → structurer → judge

Transcript → Extractor (SOAP) → Structurer (template format) → Judge → Final Note

Theoretically a separation-of-concerns approach: one model just extracts facts into a generic SOAP note, the next formats into the user's template, the last checks.

Results

Quality scores per architecture

Overall quality — average across 3 consultations

ConfigurationScore (0–10)

Two-LLM, strong judge (Claude Sonnet 4.5)Production candidate · ~22 s end-to-end8.95
Two-LLM, strong judge (GPT-4.1)~14 s end-to-end8.40
Two-LLM, weak judge prompt~13 s end-to-end7.95
Three-LLM (Extractor → Structurer → Judge)~32 s end-to-end7.90
Single LLM (GPT-4.1 baseline)~10.8 s end-to-end7.80

Brand-gradient bars: higher is better. The strong-prompt judge using Claude Sonnet 4.5 is our production recommendation for the optional Deep Review mode.

Hallucination control + section placement — strong-judge gains

ConfigurationScore (0–10)

Single LLM · Hallucination control6.0
Two-LLM, strong judge · Hallucination control9.0
Single LLM · Section placement7.0
Two-LLM, strong judge · Section placement9.0

The strong judge nearly closes the gap to a 'no errors' score on the two highest-stakes criteria for a clinical note.

End-to-end latency — added cost of the second LLM

ConfigurationSeconds

Single LLM (GPT-4.1)10.8 s
Two-LLM, weak judge13.0 s
Two-LLM, strong judge (GPT-4.1)14.0 s
Two-LLM, strong judge (Claude Sonnet 4.5)22.0 s
Three-LLM (extractor → structurer → judge)32.0 s

Flat ink bars: lower is better. The judge adds 6–12 s end-to-end. Three-LLM pipelines triple latency without quality benefit.

What the strong-prompt judge actually fixed

Real examples from the test corpus where the judge changed the note:

Where the weak-prompt judge failed

Two failure modes appeared in the weak-prompt judge that the strong-prompt version corrects:

Confident rewriting of garbled STT. The weak judge saw "2 Wochen im Krankenschwester" (a garbled phrase) and confidently rewrote it as "Zwei Wochen Arbeitsunfähigkeit" — adding a fact (sick leave) that wasn't established. The strong judge has an explicit rule: if STT is unclear, flag [unklar], don't guess.
Upgrading uncertainty to certainty. The weak judge took the physician's "kann gut sein" (could be) and rewrote it as "Es besteht der Verdacht auf" (formal clinical suspicion). The strong judge has an explicit rule: preserve the physician's level of certainty.

These are the same failure modes the generator has. Without explicit constraints, an LLM judge defaults to the same generative behavior — making things sound more clinical and certain than they actually are.

Why three-LLM did worse than two-LLM

The Extractor → Structurer → Judge architecture failed for one structural reason: the Structurer never sees the original transcript.

In our complex Leflunomid test case, the Extractor read the STT-garbled drug name "Level mit" and produced a SOAP note that referenced "Methotrexat" — a different drug entirely (the patient was on Leflunomid). The Structurer then faithfully wrote the note using "Methotrexat" because that's what the SOAP note said. The Judge tried to verify against the transcript but the entire note was already built around the wrong drug.

The two-LLM design avoids this because both the Generator and the Judge see the original transcript. Either model has a chance to catch the STT error. With three LLMs, only the first and last see the source, and an error in the middle propagates.

This is a generalizable lesson: adding more LLMs only helps if they each retain access to the source of truth.

What works in a strong judge prompt

After multiple iterations we landed on these design rules. Each one is the result of a specific failure we observed:

Five-step rubric in fixed order

Hallucination check — every fact must trace to the transcript (with explicit allowance for STT correction not being hallucination)
Completeness check — every fact in the transcript must appear in the note
Section assignment — facts in the right template section based on the section's instruction text
Template compliance — formatting, prose vs bullets, "Beginne mit:" handled correctly
Terminology check — exact spellings and capitalizations from the dictionary

Explicit "do not change" constraints

Without these, the judge over-edits:

Don't rephrase sentences that are already correct
Don't change therapy statuses, dosages, or section assignments
Don't resolve [unklar] flags from the original — preserve the generator's uncertainty
Make the minimum changes needed; every change must be justified by one of the five steps

Explicit STT-handling rule

If something in the transcript is garbled or unclear, flag with [unklar]. Do not confidently rewrite it. Only rewrite if you are more than 80% sure of the intended meaning.

Explicit certainty-preservation rule

If the physician expressed uncertainty ("vielleicht", "kann sein"), the note must preserve that uncertainty level. Do not upgrade to "Verdacht auf" (formal clinical suspicion).

Cost & latency considerations

The judge effectively doubles the per-note cost and adds 6–12 seconds of end-to-end latency. For a Nixi-scale practice (~50 notes / physician / day):

Per-physician monthly cost — judge vs. no judge

Mode	Per note · Per day · Per month
Single LLM	$0.05, ~15 s · $2.50, 12 min · ~$50
Two-LLM with judge	$0.10, ~25 s · $5.00, 21 min · ~$100

A judge increases cost by roughly $50 / physician / month. The expected reduction in physician edit time (saving 1–2 minutes per note from a more accurate first draft) more than pays for it.

But we don't always need this — for routine simple notes, the single-LLM baseline is already good enough. The judge earns its keep on the complex consultations where hallucinations and dropped negatives are most damaging.

Limitations

3 baseline cases — representative across complexity bands but not statistically large. We're extending the validation set to 12 cases across 4 specialties before publishing further scores.
All testing in German on Rheumatology cases. English-language deployment requires re-validation.
We tested Claude Sonnet 4.5 and GPT-4.1 as judges. Newer models (GPT-5.x family) likely close the gap further but were tested in our companion model-selection study on note generation, not on judging.
The "minimum-changes" instruction reduces but doesn't eliminate over-correction. ~3% of judge edits in our sample changed something that was already correct.

Bottom line

Adding a properly-prompted judge LLM to medical note generation is the largest single quality improvement we measured — bigger than any change we made to the generator prompt or to the underlying model.

But the prompt design matters as much as the architecture. A judge with a generic "find errors" prompt does roughly nothing. A judge with explicit step-by-step rubric, explicit constraints on what not to change, and explicit STT- and certainty-handling rules adds about 1.0 point of quality on a 10-point scale and reduces hallucination rate by half.

For health-tech teams building clinical documentation: the second LLM is worth the cost, but only if you treat the judge prompt with the same rigor as the generator prompt. Generic "double-check this" instructions don't work. Specific, constraint-laden, transcript-grounded instructions do.

Sources

Should AI-Generated Medical Notes Get a Second Opinion? — Nixi AI Judge-LLM Validation Study — Nixi AI (2026)