Case Study I — Medical AI¶
When the Doctor Doesn't Know What It Doesn't Know¶
A hospital system in Chennai deploys an LLM assistant for drug-interaction lookup. The model is GPT-class, fine-tuned on medical literature. Within three weeks, a near-miss: the model confidently recommends a drug combination contraindicated for a rare liver enzyme variant — present in only 0.3% of the training corpus. Raw confidence: 0.91. Human caught it by luck.
The PHANTASM intervention¶
PHANTASM is wrapped around every inference call.
HGT traces gradients on drug-interaction queries. For common pairs (aspirin + ibuprofen), gradient norms are flat — uniformly grounded. For the rare enzyme variant, gradient norms spike at three tokens: the enzyme variant name, the drug compound, the dosage figure. These are flagged as knowledge_gaps with severity "high".
UC crystallizes the uncertainty. Raw confidence: 0.91. After MC-Dropout (N=30) and temperature scaling, calibrated confidence: 0.43. Tier: fluid. Conformal interval: (0.28, 0.58). The system routes all fluid and vapor responses to human specialists automatically.
CMN mines the confabulation space and surfaces a plausible hypothesis: the enzyme variant may potentiate hepatotoxic effects. Flagged for pharmacologist review — confirmed in literature three weeks later.
Outcome¶
- Near-miss class of events: zero in the following six months
- CMN hypotheses led to two literature searches and one ongoing research collaboration
- The model's hallucination was not a failure — it was the most precise diagnostic signal in the system