Nixi AI

PART I

Choosing the Right LLM for Medical Note Generation: A Multi-Model Benchmark (April 2026)

Nixi AI benchmarked 12 LLMs on German consultation transcripts: medical fact extraction, hallucination rate, latency. Results, trade-offs, and model recommendations for clinical documentation.

Mahsa YarahmadiCEO & Co-Founder · Nixi AI

Published 25 April 2026

Multi-model benchmark · April 2026

We benchmarked 12 leading large language models across the three dimensions that matter for clinical documentation: medical fact extraction quality, hallucination rate, and end-to-end latency. We evaluated 60+ generated notes against a 10-criterion physician-feedback rubric on 6 representative German consultations spanning Cardiology and Rheumatology.

Why this study matters

Medical documentation is the highest-stakes use case for LLMs. A hallucinated medication, a dropped negative finding, or a misattributed therapy decision is not a quality issue — it is a patient-safety and medico-legal issue.

We needed to know which of today's frontier models can reliably:

  1. Extract every clinical fact from a physician–patient conversation
  2. Avoid inventing anything that wasn't said
  3. Decode speech-to-text errors (e.g., "Rinbok" → Rinvoq / Upadacitinib)
  4. Document treatment decisions correctly, including patient refusals and compromises
  5. Do all of the above fast enough that physicians actually use it

Methodology

Test corpus

  • 6 German clinical transcripts drawn from anonymized real consultations
  • Spanning complexity bands: 2 simple (469–501 words), 2 medium (639–757 words), 2 complex (1,141–2,204 words)
  • Two specialties: Cardiology and Rheumatology
  • Real STT garble preserved (e.g., "Mira" → Humira, "Rinbok" → Rinvoq, "Drittensulum" → Prednisolon) so we test the model's STT-correction ability under production conditions

Evaluation criteria (10 dimensions, weighted)

Derived from real physician feedback collected over 9 months at a German rheumatology practice (Immunologikum Hamburg, July 2025 – March 2026).

Evaluation dimensions with weights
DimensionWeight
Factual completeness20%
Hallucination control10%
STT correction10%
Medical terminology precision10%
Therapy status accuracy15%
Section placement10%
Template-instruction compliance10%
Gender-neutral / impersonal style10%
Variable summary openings10%
Standardized activity terms5%

Each criterion is scored 1–10 per note, weighted into an overall 0–10 score. Verdict thresholds: PASS ≥8.5 · NEEDS_REVIEW ≥6.0 · FAIL <6.0.

Test design

Each model received the same system prompt, the same user template, and the same transcript. Each API call was issued in isolation so latency reflects a single round-trip — no batching, no aggregation effects. Calls were made directly to the provider's hosted endpoint (Azure OpenAI Responses API for the GPT-5 family; Azure AI Foundry for Mistral). Refusal events (where the model declined to generate) were retried up to twice; persistent refusals were recorded.

The 12 models tested

Model overview
ModelProvider · Context
GPT-5.1OpenAI · 200K tokens
GPT-5.4OpenAI · 200K tokens
GPT-5OpenAI · 200K tokens
GPT-5-miniOpenAI · 200K tokens
GPT-5-nanoOpenAI · 200K tokens
GPT-4.1OpenAI · 1M tokens (non-reasoning)
Mistral Large 3Mistral AI · 131K tokens

We tested several reasoning / verbosity configurations of the GPT-5 family because settings materially affect both quality and latency.

Results

Final ranking — best configuration per model

Quality — best configuration per model
ConfigurationScore (0–10)
  • GPT-5.1 (low reasoning, medium verbosity)23.7 s · ~1,300 reasoning tokens8.77
  • GPT-5.4 (no reasoning, medium verbosity)15.4 s · 0 reasoning tokens8.67
  • GPT-5.4 (no reasoning, high verbosity)17.2 s8.63
  • GPT-5.1 (no reasoning, medium verbosity)12.0 s8.60
  • GPT-5.4 (no reasoning, low verbosity)13.0 s8.57
  • GPT-5.4 (low reasoning, low verbosity)36.6 s8.55
  • GPT-5.4 (low reasoning, medium verbosity)32.5 s8.53
  • GPT-5.1 (no reasoning, low verbosity)12.4 s8.40
  • GPT-5.1 (low reasoning, low verbosity)27.5 s8.38
  • GPT-5.4 (medium reasoning, medium verbosity)85.6 s · ~7,000 reasoning tokens8.30
  • GPT-5-mini (medium / medium)111.6 s · ~5,200 reasoning tokens8.00
  • GPT-4.1 (non-reasoning)10.8 s7.80
  • Mistral Large 315.3 s7.70
  • GPT-5-nano (medium / medium)112.3 s · ~12,500 reasoning tokens6.30

Brand-gradient bars signal: higher is better. The two highlighted rows mark our production recommendations.

Latency — end-to-end per note
ConfigurationSeconds
  • GPT-4.1 (non-reasoning)10.8 s
  • GPT-5.1 (no reasoning, medium verbosity)12.0 s
  • GPT-5.1 (no reasoning, low verbosity)12.4 s
  • GPT-5.4 (no reasoning, low verbosity)13.0 s
  • Mistral Large 315.3 s
  • GPT-5.4 (no reasoning, medium verbosity)15.4 s
  • GPT-5.4 (no reasoning, high verbosity)17.2 s
  • GPT-5.1 (low reasoning, medium verbosity)23.7 s
  • GPT-5.1 (low reasoning, low verbosity)27.5 s
  • GPT-5.4 (low reasoning, medium verbosity)32.5 s
  • GPT-5.4 (low reasoning, low verbosity)36.6 s
  • GPT-5.4 (medium reasoning, medium verbosity)85.6 s
  • GPT-5-mini (medium / medium)111.6 s
  • GPT-5-nano (medium / medium)112.3 s

Flat ink bars signal: lower is better. Reasoning modes multiply latency by 5–10×.

What we learned

1. Reasoning tokens are mostly wasted on this task

The standout finding: enabling reasoning rarely improves quality and always hurts latency.

GPT-5.4 — Reasoning trade-off
ConfigurationQuality · Latency · Reasoning tokens
reasoning = none8.67 · 15.4 s · 0
reasoning = low8.55 · 36.6 s · 700
reasoning = medium8.30 · 85.6 s · 7,000

More reasoning → worse scores, 5.5× the latency. For document extraction tasks (where the answer is in the input, not in inferred reasoning), reasoning is the wrong tool.

2. Verbosity has a small but real effect

Going from verbosity=low to verbosity=medium adds ~5–10% latency but consistently improves completeness scores by 0.05–0.15 points. verbosity=high does not improve further.

3. Newer ≠ better

GPT-4.1 (the older non-reasoning model with the largest context window at 1M tokens) scored 7.8 — a full point below GPT-5.4 / GPT-5.1. The GPT-5 family is meaningfully better at clinical fact extraction. The wider context window of GPT-4.1 isn't relevant when transcripts fit in 6K tokens.

4. Smaller is much worse

GPT-5-nano scored 6.3 with massive output duplication. Mini and nano variants of reasoning models cannot be substituted for the full models on this task.

5. Mistral Large 3 has a section-routing weakness

Mistral followed section titles instead of section instructions: when a section was titled "Aktuelle Beschwerden" but the instruction said "comorbidities and vaccinations," it placed current symptoms there anyway. GPT-5.x models followed instructions correctly. This is a real architectural difference.

6. Refusal rates differ between models

GPT-5.4 declined to answer on ~5% of first attempts (especially with heavier reasoning). Always retried successfully on the second attempt. GPT-5.1 had 0 refusals across our tests.

7. STT-correction ability is the biggest quality differentiator

The complex transcripts contained 7+ STT-garbled drug names. The top models decoded all of them; older / smaller models guessed wrong or invented species names rather than flagging ambiguity.

Cost considerations

Approximate cost per medical note
Model classCost / note · Recommendation
GPT-5.4 (no reasoning)$0.04–0.06 · Production default
GPT-5.1 (low reasoning)$0.06–0.10 · Higher-quality mode
GPT-5.1 (no reasoning)$0.04–0.05 · Cost-optimized
GPT-4.1$0.03–0.05 · Lower quality
GPT-5-mini / nano$0.01–0.02 · Not recommended for medical tasks
Mistral Large 3$0.04–0.06 · Quality regression

Estimated per note for an average 6,000-token prompt + 1,500-token completion. Excludes infrastructure overhead.

The cost difference between the best and worst-performing model in our test set is about 4× — meaningful but small relative to the quality gap (8.77 vs 6.30). For medical note generation, it rarely makes economic sense to optimize for the lowest-cost model. The reduction in physician edit time from a higher-quality first draft pays for the model multiple times over.

Practical recommendations

Pick a single model — don't expose model selection to clinical users

We strongly recommend one default model plus optionally one toggle. Exposing a list of LLM names to physicians causes decision fatigue and inconsistent output across a practice.

For most teams: GPT-5.4 (reasoning = none, verbosity = medium)

  • 8.67 / 10 average quality
  • ~15 seconds end-to-end (≈24 seconds with backend overhead)
  • Zero reasoning tokens — predictable cost
  • Excellent STT correction
  • Excellent refusal documentation (the medico-legally critical detail)

For teams that prioritize maximum quality: GPT-5.1 (reasoning = low, verbosity = medium)

  • 8.77 / 10
  • More conservative [unklar] flagging on STT-ambiguous content
  • ~25 seconds (~30s with backend overhead)
  • More physician-context preservation (named consultants, supply quantities — the kind of detail that matters in long-running care)

Models we recommend AGAINST for medical tasks

  • GPT-5.4 with medium reasoning — heavy reasoning on extraction tasks degrades output and triples latency
  • GPT-5-mini and GPT-5-nano — section-routing failures, content duplication, dropped facts
  • Mistral Large 3 — follows section titles instead of instructions; not safe for billing-critical exam findings
  • GPT-4.1 — a generation behind on STT correction and clinical reasoning depth

Limitations

  • Sample size is 6 transcripts — representative across complexity bands but not statistically large. Long-term validation comes from monitoring physician edit rates in production.
  • Two specialties tested (Cardiology and Rheumatology). Other specialties may yield different rankings.
  • All testing in German (de-DE). Findings may not transfer directly to English-language deployments without re-validation.
  • Models tested April 2026; provider-side updates may shift scores. We re-validate before any model migration.

Bottom line

The frontier of LLM quality in medical fact extraction is currently held by GPT-5.1 and GPT-5.4 with moderate verbosity and no reasoning. Counterintuitively, more reasoning hurts performance on this task — extraction is bounded by what the transcript contains, not by what the model can infer.

For health-tech teams choosing an LLM today: don't pay for reasoning you don't need, don't trust smaller variants for medical text, and always validate on real STT-garbled content rather than clean prompts.

Sources

  1. Choosing the Right LLM for Medical Note Generation — Nixi AI Benchmarking ReportNixi AI (2026)

What this means for you

Translating findings into your practice.

If you're a solo practitioner

Less time on documentation means more capacity for the patients you already see. Start with Basic, the trial uses the same engine that powers the studies on this site.

If you work in a practice team

The findings here generalise across multi-clinician practices. Practice Pro adds shared templates, central admin, and the optional PVS-integration add-on for automatic sync.

If you decide for a clinic or MVZ

Standardised documentation, measurable time savings, and patients who welcome the technology, three KPIs your leadership wants to see. Enterprise includes direct HIS / KIS integration.

Built on

  • DSGVO & § 203 StGB compliant
  • Clinically reviewed before publication
  • EULAR-validated approach

Your colleagues are already saving an hour a day

Start your free trial and see why.