Choosing the Right LLM for Medical Note Generation: A Multi-Model Benchmark (April 2026)

Multi-model benchmark · April 2026

We benchmarked 12 leading large language models across the three dimensions that matter for clinical documentation: medical fact extraction quality, hallucination rate, and end-to-end latency. We evaluated 60+ generated notes against a 10-criterion physician-feedback rubric on 6 representative German consultations spanning Cardiology and Rheumatology.

Why this study matters

Medical documentation is the highest-stakes use case for LLMs. A hallucinated medication, a dropped negative finding, or a misattributed therapy decision is not a quality issue — it is a patient-safety and medico-legal issue.

We needed to know which of today's frontier models can reliably:

Extract every clinical fact from a physician–patient conversation
Avoid inventing anything that wasn't said
Decode speech-to-text errors (e.g., "Rinbok" → Rinvoq / Upadacitinib)
Document treatment decisions correctly, including patient refusals and compromises
Do all of the above fast enough that physicians actually use it

Methodology

Test corpus

6 German clinical transcripts drawn from anonymized real consultations
Spanning complexity bands: 2 simple (469–501 words), 2 medium (639–757 words), 2 complex (1,141–2,204 words)
Two specialties: Cardiology and Rheumatology
Real STT garble preserved (e.g., "Mira" → Humira, "Rinbok" → Rinvoq, "Drittensulum" → Prednisolon) so we test the model's STT-correction ability under production conditions

Evaluation criteria (10 dimensions, weighted)

Derived from real physician feedback collected over 9 months at a German rheumatology practice (Immunologikum Hamburg, July 2025 – March 2026).

Evaluation dimensions with weights

Dimension	Weight
Factual completeness	20%
Hallucination control	10%
STT correction	10%
Medical terminology precision	10%
Therapy status accuracy	15%
Section placement	10%
Template-instruction compliance	10%
Gender-neutral / impersonal style	10%
Variable summary openings	10%
Standardized activity terms	5%

Each criterion is scored 1–10 per note, weighted into an overall 0–10 score. Verdict thresholds: PASS ≥8.5 · NEEDS_REVIEW ≥6.0 · FAIL <6.0.

Test design

Each model received the same system prompt, the same user template, and the same transcript. Each API call was issued in isolation so latency reflects a single round-trip — no batching, no aggregation effects. Calls were made directly to the provider's hosted endpoint (Azure OpenAI Responses API for the GPT-5 family; Azure AI Foundry for Mistral). Refusal events (where the model declined to generate) were retried up to twice; persistent refusals were recorded.

The 12 models tested

Model overview

Model	Provider · Context
GPT-5.1	OpenAI · 200K tokens
GPT-5.4	OpenAI · 200K tokens
GPT-5	OpenAI · 200K tokens
GPT-5-mini	OpenAI · 200K tokens
GPT-5-nano	OpenAI · 200K tokens
GPT-4.1	OpenAI · 1M tokens (non-reasoning)
Mistral Large 3	Mistral AI · 131K tokens

We tested several reasoning / verbosity configurations of the GPT-5 family because settings materially affect both quality and latency.

Results

Final ranking — best configuration per model

Quality — best configuration per model

ConfigurationScore (0–10)

GPT-5.1 (low reasoning, medium verbosity)23.7 s · ~1,300 reasoning tokens8.77
GPT-5.4 (no reasoning, medium verbosity)15.4 s · 0 reasoning tokens8.67
GPT-5.4 (no reasoning, high verbosity)17.2 s8.63
GPT-5.1 (no reasoning, medium verbosity)12.0 s8.60
GPT-5.4 (no reasoning, low verbosity)13.0 s8.57
GPT-5.4 (low reasoning, low verbosity)36.6 s8.55
GPT-5.4 (low reasoning, medium verbosity)32.5 s8.53
GPT-5.1 (no reasoning, low verbosity)12.4 s8.40
GPT-5.1 (low reasoning, low verbosity)27.5 s8.38
GPT-5.4 (medium reasoning, medium verbosity)85.6 s · ~7,000 reasoning tokens8.30
GPT-5-mini (medium / medium)111.6 s · ~5,200 reasoning tokens8.00
GPT-4.1 (non-reasoning)10.8 s7.80
Mistral Large 315.3 s7.70
GPT-5-nano (medium / medium)112.3 s · ~12,500 reasoning tokens6.30

Brand-gradient bars signal: higher is better. The two highlighted rows mark our production recommendations.

Latency — end-to-end per note

ConfigurationSeconds

GPT-4.1 (non-reasoning)10.8 s
GPT-5.1 (no reasoning, medium verbosity)12.0 s
GPT-5.1 (no reasoning, low verbosity)12.4 s
GPT-5.4 (no reasoning, low verbosity)13.0 s
Mistral Large 315.3 s
GPT-5.4 (no reasoning, medium verbosity)15.4 s
GPT-5.4 (no reasoning, high verbosity)17.2 s
GPT-5.1 (low reasoning, medium verbosity)23.7 s
GPT-5.1 (low reasoning, low verbosity)27.5 s
GPT-5.4 (low reasoning, medium verbosity)32.5 s
GPT-5.4 (low reasoning, low verbosity)36.6 s
GPT-5.4 (medium reasoning, medium verbosity)85.6 s
GPT-5-mini (medium / medium)111.6 s
GPT-5-nano (medium / medium)112.3 s

Flat ink bars signal: lower is better. Reasoning modes multiply latency by 5–10×.

What we learned

1. Reasoning tokens are mostly wasted on this task

The standout finding: enabling reasoning rarely improves quality and always hurts latency.

GPT-5.4 — Reasoning trade-off

Configuration	Quality · Latency · Reasoning tokens
reasoning = none	8.67 · 15.4 s · 0
reasoning = low	8.55 · 36.6 s · 700
reasoning = medium	8.30 · 85.6 s · 7,000

More reasoning → worse scores, 5.5× the latency. For document extraction tasks (where the answer is in the input, not in inferred reasoning), reasoning is the wrong tool.

2. Verbosity has a small but real effect

Going from verbosity=low to verbosity=medium adds ~5–10% latency but consistently improves completeness scores by 0.05–0.15 points. verbosity=high does not improve further.

3. Newer ≠ better

GPT-4.1 (the older non-reasoning model with the largest context window at 1M tokens) scored 7.8 — a full point below GPT-5.4 / GPT-5.1. The GPT-5 family is meaningfully better at clinical fact extraction. The wider context window of GPT-4.1 isn't relevant when transcripts fit in 6K tokens.

4. Smaller is much worse

GPT-5-nano scored 6.3 with massive output duplication. Mini and nano variants of reasoning models cannot be substituted for the full models on this task.

5. Mistral Large 3 has a section-routing weakness

Mistral followed section titles instead of section instructions: when a section was titled "Aktuelle Beschwerden" but the instruction said "comorbidities and vaccinations," it placed current symptoms there anyway. GPT-5.x models followed instructions correctly. This is a real architectural difference.

6. Refusal rates differ between models

GPT-5.4 declined to answer on ~5% of first attempts (especially with heavier reasoning). Always retried successfully on the second attempt. GPT-5.1 had 0 refusals across our tests.

7. STT-correction ability is the biggest quality differentiator

The complex transcripts contained 7+ STT-garbled drug names. The top models decoded all of them; older / smaller models guessed wrong or invented species names rather than flagging ambiguity.

Cost considerations

Approximate cost per medical note

Model class	Cost / note · Recommendation
GPT-5.4 (no reasoning)	$0.04–0.06 · Production default
GPT-5.1 (low reasoning)	$0.06–0.10 · Higher-quality mode
GPT-5.1 (no reasoning)	$0.04–0.05 · Cost-optimized
GPT-4.1	$0.03–0.05 · Lower quality
GPT-5-mini / nano	$0.01–0.02 · Not recommended for medical tasks
Mistral Large 3	$0.04–0.06 · Quality regression

Estimated per note for an average 6,000-token prompt + 1,500-token completion. Excludes infrastructure overhead.

The cost difference between the best and worst-performing model in our test set is about 4× — meaningful but small relative to the quality gap (8.77 vs 6.30). For medical note generation, it rarely makes economic sense to optimize for the lowest-cost model. The reduction in physician edit time from a higher-quality first draft pays for the model multiple times over.

Practical recommendations

Pick a single model — don't expose model selection to clinical users

We strongly recommend one default model plus optionally one toggle. Exposing a list of LLM names to physicians causes decision fatigue and inconsistent output across a practice.

For most teams: GPT-5.4 (reasoning = none, verbosity = medium)

8.67 / 10 average quality
~15 seconds end-to-end (≈24 seconds with backend overhead)
Zero reasoning tokens — predictable cost
Excellent STT correction
Excellent refusal documentation (the medico-legally critical detail)

For teams that prioritize maximum quality: GPT-5.1 (reasoning = low, verbosity = medium)

8.77 / 10
More conservative [unklar] flagging on STT-ambiguous content
~25 seconds (~30s with backend overhead)
More physician-context preservation (named consultants, supply quantities — the kind of detail that matters in long-running care)

Models we recommend AGAINST for medical tasks

GPT-5.4 with medium reasoning — heavy reasoning on extraction tasks degrades output and triples latency
GPT-5-mini and GPT-5-nano — section-routing failures, content duplication, dropped facts
Mistral Large 3 — follows section titles instead of instructions; not safe for billing-critical exam findings
GPT-4.1 — a generation behind on STT correction and clinical reasoning depth

Limitations

Sample size is 6 transcripts — representative across complexity bands but not statistically large. Long-term validation comes from monitoring physician edit rates in production.
Two specialties tested (Cardiology and Rheumatology). Other specialties may yield different rankings.
All testing in German (de-DE). Findings may not transfer directly to English-language deployments without re-validation.
Models tested April 2026; provider-side updates may shift scores. We re-validate before any model migration.

Bottom line

The frontier of LLM quality in medical fact extraction is currently held by GPT-5.1 and GPT-5.4 with moderate verbosity and no reasoning. Counterintuitively, more reasoning hurts performance on this task — extraction is bounded by what the transcript contains, not by what the model can infer.

For health-tech teams choosing an LLM today: don't pay for reasoning you don't need, don't trust smaller variants for medical text, and always validate on real STT-garbled content rather than clean prompts.

Sources

Choosing the Right LLM for Medical Note Generation — Nixi AI Benchmarking Report — Nixi AI (2026)