Start free
← Journal21 May 2026

Why we stopped fine-tuning and bet on RAG instead

After three fine-tuning runs and a Gemma rebase that regressed quality, we ratified ADR-007: the moat was never the weights. Here's what the benchmark data said and what we built instead.

engineeringclinical-airagarchitecture

Six months in, KAI had a problem. The v0.5 LoRA checkpoint — Mistral 7B, fine-tuned on 2,400 synthetic SOAP notes — was producing reasonable output. But v0.8, which rebased onto Gemma 3, regressed by 10 percentage points on our structured rubric. The fine-tuning loop was producing diminishing returns and we couldn't explain why.

We ran the numbers and ratified a new architecture decision. This is the story of that decision.

What the data said

Our Phase 3 evaluation set had grown to 57 cases across five languages (DE, EN, FR, IT, ES). The benchmark showed:

ProviderPass rateNotes
mistral-local (v0.5 LoRA)51 / 57 (89%)Failures concentrated in FR/ES
kai-local (Gemma rebase)46 / 57 (81%)Regression vs v0.5
mistral-local (RAG v2)57 / 57 (100%)After ADR-007 pivot

The failure mode for kai-local was not safety — it got 10/11 species blacklist adversarial cases right. The failure mode was multilingual coverage and citation quality. The model kept hallucinating source references.

We had a 26% phantom citation rate on kai-local. Every claim that couldn't be traced back to a retrieved source chunk was a fabrication. In a clinical context, that's not a model quality problem — it's a trust problem.

The diagnosis

Fine-tuning was doing two jobs at once: teaching the model clinical knowledge, and teaching it the SOAP format. The format work was succeeding. The knowledge work wasn't — because knowledge baked into weights is static, unauditable, and expensive to update.

The moment a drug interaction table changes, or a new WSAVA guideline ships, the weights are stale. And you can't tell which claim came from training data vs. which came from the input.

RAG solves the audit problem structurally. If every clinical claim must reference a retrieved chunk, you can trace every assertion back to a source document with a version and a date. The phantom citation rate drops from 26% to 0.2% — not because the model got smarter, but because the architecture now enforces provenance.

What we built (ADR-007)

The new stack:

  • Embeddings: BGE-M3 (multilingual, 1024-dim) — runs locally on M5 Max
  • Vector store: LanceDB hybrid retrieval (dense + sparse BM25)
  • Reranker: MLX bge-reranker-v2-m3 — local inference, 30ms latency
  • Citation schema: ADR-008 v3 — every plan.medications and plan.diagnostics item carries a cite[] array of chunk IDs. The Layer-2 classifier rejects responses where Assessment or Plan sections make claims without chunk references.

The corpus covers Merck Veterinary Manual, PubMed OA (vet-filtered), FDA Animal Drugs, Swissmedic, and AAHA/WSAVA clinical guidelines — all with source metadata, version dates, and section identifiers.

The moat reframed

We spent the first four months thinking the moat was the weights: a fine-tuned model that knew veterinary medicine better than the base model. That turned out to be the wrong frame.

The actual moat is the corpus and the trust infrastructure:

  1. Corpus quality — curated, versioned, Swiss-and-EU-oriented clinical literature. When Swissmedic updates a drug monograph, we update the index. The model doesn't need to be retrained.
  2. Citation enforcement — structural, not probabilistic. A response without grounded citations cannot pass the Layer-2 check and is rejected before it reaches the vet's screen.
  3. Multilingual parity — BGE-M3 embeds across all five languages natively. The retrieval quality in FR and IT is now comparable to DE/EN, which was impossible with a weight-baked approach.

The v0.5 LoRA still ships as kai-local — it's the sovereignty option for clinics that cannot send data to any cloud provider, even Anthropic EU. But it's no longer the primary quality lever.

What this means for the vet surface

For the clinician using Loki, the visible change is citation chips. Every SOAP Assessment and Plan item that references a clinical source shows a small chip with the source name and section. Clicking it opens the retrieved passage in a side panel.

This does two things: it gives the vet a way to verify any AI claim in under 10 seconds, and it gives us an audit trail that satisfies the clinical governance requirements we're working through with Swissmedic.

The 99.8% citation grounding rate (1 phantom per 492 grounded claims, in a 57-case benchmark) is the number we're taking into the Vetsuisse pilot conversation. It's not a product claim — it's a measurable property of the architecture.


KAI is the clinical AI engine that powers Loki's vet surface. It runs locally on Swiss compute. Source: github.com/thoughtful-toby/kai

Loki uses one tool — PostHog EU — to understand how the marketing site is used. No third-party trackers on signed-in surfaces. Accept to help us improve, or decline and continue.