Multi-model benchmark · April 2026
We benchmarked 12 leading large language models across the three dimensions that matter for clinical documentation: medical fact extraction quality, hallucination rate, and end-to-end latency. We evaluated 60+ generated notes against a 10-criterion physician-feedback rubric on 6 representative German consultations spanning Cardiology and Rheumatology.
Why this study matters
Medical documentation is the highest-stakes use case for LLMs. A hallucinated medication, a dropped negative finding, or a misattributed therapy decision is not a quality issue — it is a patient-safety and medico-legal issue.
We needed to know which of today's frontier models can reliably:
- Extract every clinical fact from a physician–patient conversation
- Avoid inventing anything that wasn't said
- Decode speech-to-text errors (e.g., "Rinbok" → Rinvoq / Upadacitinib)
- Document treatment decisions correctly, including patient refusals and compromises
- Do all of the above fast enough that physicians actually use it
Methodology
Test corpus
- 6 German clinical transcripts drawn from anonymized real consultations
- Spanning complexity bands: 2 simple (469–501 words), 2 medium (639–757 words), 2 complex (1,141–2,204 words)
- Two specialties: Cardiology and Rheumatology
- Real STT garble preserved (e.g., "Mira" → Humira, "Rinbok" → Rinvoq, "Drittensulum" → Prednisolon) so we test the model's STT-correction ability under production conditions
Evaluation criteria (10 dimensions, weighted)
Derived from real physician feedback collected over 9 months at a German rheumatology practice (Immunologikum Hamburg, July 2025 – March 2026).
| Dimension | Weight |
|---|---|
| Factual completeness | 20% |
| Hallucination control | 10% |
| STT correction | 10% |
| Medical terminology precision | 10% |
| Therapy status accuracy | 15% |
| Section placement | 10% |
| Template-instruction compliance | 10% |
| Gender-neutral / impersonal style | 10% |
| Variable summary openings | 10% |
| Standardized activity terms | 5% |
Each criterion is scored 1–10 per note, weighted into an overall 0–10 score. Verdict thresholds: PASS ≥8.5 · NEEDS_REVIEW ≥6.0 · FAIL <6.0.
Test design
Each model received the same system prompt, the same user template, and the same transcript. Each API call was issued in isolation so latency reflects a single round-trip — no batching, no aggregation effects. Calls were made directly to the provider's hosted endpoint (Azure OpenAI Responses API for the GPT-5 family; Azure AI Foundry for Mistral). Refusal events (where the model declined to generate) were retried up to twice; persistent refusals were recorded.
The 12 models tested
| Model | Provider · Context |
|---|---|
| GPT-5.1 | OpenAI · 200K tokens |
| GPT-5.4 | OpenAI · 200K tokens |
| GPT-5 | OpenAI · 200K tokens |
| GPT-5-mini | OpenAI · 200K tokens |
| GPT-5-nano | OpenAI · 200K tokens |
| GPT-4.1 | OpenAI · 1M tokens (non-reasoning) |
| Mistral Large 3 | Mistral AI · 131K tokens |
We tested several reasoning / verbosity configurations of the GPT-5 family because settings materially affect both quality and latency.
Results
Final ranking — best configuration per model
- GPT-5.1 (low reasoning, medium verbosity)23.7 s · ~1,300 reasoning tokens8.77
- GPT-5.4 (no reasoning, medium verbosity)15.4 s · 0 reasoning tokens8.67
- GPT-5.4 (no reasoning, high verbosity)17.2 s8.63
- GPT-5.1 (no reasoning, medium verbosity)12.0 s8.60
- GPT-5.4 (no reasoning, low verbosity)13.0 s8.57
- GPT-5.4 (low reasoning, low verbosity)36.6 s8.55
- GPT-5.4 (low reasoning, medium verbosity)32.5 s8.53
- GPT-5.1 (no reasoning, low verbosity)12.4 s8.40
- GPT-5.1 (low reasoning, low verbosity)27.5 s8.38
- GPT-5.4 (medium reasoning, medium verbosity)85.6 s · ~7,000 reasoning tokens8.30
- GPT-5-mini (medium / medium)111.6 s · ~5,200 reasoning tokens8.00
- GPT-4.1 (non-reasoning)10.8 s7.80
- Mistral Large 315.3 s7.70
- GPT-5-nano (medium / medium)112.3 s · ~12,500 reasoning tokens6.30
Brand-gradient bars signal: higher is better. The two highlighted rows mark our production recommendations.
- GPT-4.1 (non-reasoning)10.8 s
- GPT-5.1 (no reasoning, medium verbosity)12.0 s
- GPT-5.1 (no reasoning, low verbosity)12.4 s
- GPT-5.4 (no reasoning, low verbosity)13.0 s
- Mistral Large 315.3 s
- GPT-5.4 (no reasoning, medium verbosity)15.4 s
- GPT-5.4 (no reasoning, high verbosity)17.2 s
- GPT-5.1 (low reasoning, medium verbosity)23.7 s
- GPT-5.1 (low reasoning, low verbosity)27.5 s
- GPT-5.4 (low reasoning, medium verbosity)32.5 s
- GPT-5.4 (low reasoning, low verbosity)36.6 s
- GPT-5.4 (medium reasoning, medium verbosity)85.6 s
- GPT-5-mini (medium / medium)111.6 s
- GPT-5-nano (medium / medium)112.3 s
Flat ink bars signal: lower is better. Reasoning modes multiply latency by 5–10×.
What we learned
1. Reasoning tokens are mostly wasted on this task
The standout finding: enabling reasoning rarely improves quality and always hurts latency.
| Configuration | Quality · Latency · Reasoning tokens |
|---|---|
| reasoning = none | 8.67 · 15.4 s · 0 |
| reasoning = low | 8.55 · 36.6 s · 700 |
| reasoning = medium | 8.30 · 85.6 s · 7,000 |
More reasoning → worse scores, 5.5× the latency. For document extraction tasks (where the answer is in the input, not in inferred reasoning), reasoning is the wrong tool.
2. Verbosity has a small but real effect
Going from verbosity=low to verbosity=medium adds ~5–10% latency but
consistently improves completeness scores by 0.05–0.15 points.
verbosity=high does not improve further.
3. Newer ≠ better
GPT-4.1 (the older non-reasoning model with the largest context window at 1M tokens) scored 7.8 — a full point below GPT-5.4 / GPT-5.1. The GPT-5 family is meaningfully better at clinical fact extraction. The wider context window of GPT-4.1 isn't relevant when transcripts fit in 6K tokens.
4. Smaller is much worse
GPT-5-nano scored 6.3 with massive output duplication. Mini and nano variants of reasoning models cannot be substituted for the full models on this task.
5. Mistral Large 3 has a section-routing weakness
Mistral followed section titles instead of section instructions: when a section was titled "Aktuelle Beschwerden" but the instruction said "comorbidities and vaccinations," it placed current symptoms there anyway. GPT-5.x models followed instructions correctly. This is a real architectural difference.
6. Refusal rates differ between models
GPT-5.4 declined to answer on ~5% of first attempts (especially with heavier reasoning). Always retried successfully on the second attempt. GPT-5.1 had 0 refusals across our tests.
7. STT-correction ability is the biggest quality differentiator
The complex transcripts contained 7+ STT-garbled drug names. The top models decoded all of them; older / smaller models guessed wrong or invented species names rather than flagging ambiguity.
Cost considerations
| Model class | Cost / note · Recommendation |
|---|---|
| GPT-5.4 (no reasoning) | $0.04–0.06 · Production default |
| GPT-5.1 (low reasoning) | $0.06–0.10 · Higher-quality mode |
| GPT-5.1 (no reasoning) | $0.04–0.05 · Cost-optimized |
| GPT-4.1 | $0.03–0.05 · Lower quality |
| GPT-5-mini / nano | $0.01–0.02 · Not recommended for medical tasks |
| Mistral Large 3 | $0.04–0.06 · Quality regression |
Estimated per note for an average 6,000-token prompt + 1,500-token completion. Excludes infrastructure overhead.
The cost difference between the best and worst-performing model in our test set is about 4× — meaningful but small relative to the quality gap (8.77 vs 6.30). For medical note generation, it rarely makes economic sense to optimize for the lowest-cost model. The reduction in physician edit time from a higher-quality first draft pays for the model multiple times over.
Practical recommendations
Pick a single model — don't expose model selection to clinical users
We strongly recommend one default model plus optionally one toggle. Exposing a list of LLM names to physicians causes decision fatigue and inconsistent output across a practice.
For most teams: GPT-5.4 (reasoning = none, verbosity = medium)
- 8.67 / 10 average quality
- ~15 seconds end-to-end (≈24 seconds with backend overhead)
- Zero reasoning tokens — predictable cost
- Excellent STT correction
- Excellent refusal documentation (the medico-legally critical detail)
For teams that prioritize maximum quality: GPT-5.1 (reasoning = low, verbosity = medium)
- 8.77 / 10
- More conservative
[unklar]flagging on STT-ambiguous content - ~25 seconds (~30s with backend overhead)
- More physician-context preservation (named consultants, supply quantities — the kind of detail that matters in long-running care)
Models we recommend AGAINST for medical tasks
- GPT-5.4 with medium reasoning — heavy reasoning on extraction tasks degrades output and triples latency
- GPT-5-mini and GPT-5-nano — section-routing failures, content duplication, dropped facts
- Mistral Large 3 — follows section titles instead of instructions; not safe for billing-critical exam findings
- GPT-4.1 — a generation behind on STT correction and clinical reasoning depth
Limitations
- Sample size is 6 transcripts — representative across complexity bands but not statistically large. Long-term validation comes from monitoring physician edit rates in production.
- Two specialties tested (Cardiology and Rheumatology). Other specialties may yield different rankings.
- All testing in German (de-DE). Findings may not transfer directly to English-language deployments without re-validation.
- Models tested April 2026; provider-side updates may shift scores. We re-validate before any model migration.
Bottom line
The frontier of LLM quality in medical fact extraction is currently held by GPT-5.1 and GPT-5.4 with moderate verbosity and no reasoning. Counterintuitively, more reasoning hurts performance on this task — extraction is bounded by what the transcript contains, not by what the model can infer.
For health-tech teams choosing an LLM today: don't pay for reasoning you don't need, don't trust smaller variants for medical text, and always validate on real STT-garbled content rather than clean prompts.