Research Paper¶
PHANTASM: Inverting LLM Failure Modes into Productive Features via
Probabilistic Hallucination-Aware Neural Transformation
with Adaptive Synthesis Method
Vignesh S
Department of Computer Science and Engineering, Takshashila University, Chennai, India
applemacbook6sep2004@gmail.com
Abstract¶
Large language models (LLMs) exhibit three persistent failure modes — hallucination, confabulation, and epistemic miscalibration — that the research community has uniformly treated as defects to be suppressed. We challenge this framing. We present PHANTASM (Probabilistic Hallucination-Aware Neural Transformation with Adaptive Synthesis Method), a novel inference-time framework that mathematically inverts each failure mode into a structured, actionable asset. Concretely: (1) Hallucination Gradient Tracing (HGT) exploits the gradient of a self-consistency loss with respect to input embeddings to produce a Competency Atlas — a per-token knowledge-boundary map requiring no ground truth; (2) Confabulation Mining Network (CMN) uses a contrastive dual-encoder to harvest novel, plausible hypotheses from the model's creative gap-filling, with applications in scientific discovery; (3) Uncertainty Crystallization (UC) chains Monte-Carlo Dropout, learned temperature scaling, and Conformal Prediction to produce statistically-guaranteed, four-tier reliability classifications. PHANTASM operates post-hoc on any HuggingFace causal LM without weight modification. HGT achieves AUROC 0.83 on LLaMA-7B (+9.2% over SelfCheckGPT). UC reduces ECE from 0.187 to 0.041 on GPT-2. CMN surfaces hypotheses with 67% novelty at 77% expert plausibility versus 8% novelty under standard filtering.
Code: pip install phantasm-llm · GitHub: github.com/vignesh2027/PHANTASM
1. Introduction¶
The dominant narrative in LLM reliability research treats hallucination as an adversary. Retrieval-Augmented Generation [Lewis et al., 2020] grounds generation in retrieved documents. RLHF [Ouyang et al., 2022] trains models away from hallucinated outputs. Fact-verification pipelines [Min et al., 2023] reject non-factual generations. SelfCheckGPT [Manakul et al., 2023] uses multi-sample inconsistency as a hallucination signal. Every one of these frameworks treats the hallucinated output as waste.
We argue this is a fundamental misreading. When a model hallucinates at position i, it does so because the gradient landscape at that position is steep — the model's representation is at the boundary of its training distribution. This is not random noise. It is a precise, reproducible, gradient-traceable signal about where the model's knowledge ends. Similarly, confabulations are not random fabrications; they are paths through the model's learned semantic manifold that training data never explicitly charted — candidate novel concept combinations. And epistemic miscalibration encodes, in its structure, which training distributions were overrepresented.
PHANTASM is the first framework to operationalize these observations into a unified three-pillar system.
2. Related Work¶
Hallucination detection. SAPLMA [Azaria & Mitchell, 2023] probes internal hidden states. SelfCheckGPT [Manakul et al., 2023] uses multi-sample consistency. FActScoring [Min et al., 2023] decomposes claims and verifies via retrieval. All require labels, multiple samples, or external knowledge bases. HGT requires none — one forward-backward pass suffices.
Calibration. Temperature scaling [Guo et al., 2017] minimizes NLL post-hoc. No prior LLM calibration work provides statistically-guaranteed coverage. UC adds Conformal Prediction [Angelopoulos & Bates, 2022] to provide such guarantees.
Hypothesis generation. Drug discovery [Bran et al., 2023] and scientific hypothesis generation [Wang et al., 2023] applications treat confabulations as noise. CMN is the first contrastive confabulation mining system.
3. Method¶
3.1 Hallucination Gradient Tracing (HGT)¶
Let $f_\theta$ be a causal LM. Given input embeddings $e_i \in \mathbb{R}^d$, the self-consistency loss is:
$$\mathcal{L}{\text{sc}} = -\sum}^{T} \log p_\theta(\hat{yt \mid x}), \quad \hat{yt = \arg\max_v p\theta(v \mid x_{<t})$$
The HGT knowledge-boundary score at position $i$:
$$s_i = 1 - \frac{|\nabla_{e_i} \mathcal{L}{\text{sc}}|}{\max_j |\nabla$$} \mathcal{L}_{\text{sc}}|
$s_i \approx 0$ at knowledge boundaries (high gradient); $s_i \approx 1$ where grounded. Overall hallucination risk: $r = 1 - \frac{1}{T}\sum_i s_i$. Computational cost: one forward + one backward pass.
3.2 Confabulation Mining Network (CMN)¶
CMN is a dual-encoder $(\phi_C, \phi_N, \phi_P)$:
- ConceptExtractor $\phi_C$: 2-layer Transformer encoder → concept vectors $\mathbf{c} \in \mathbb{R}^{L \times d}$
- NoveltyScorer $\phi_N$: MLP scoring distance from factual space
- PlausibilityScorer $\phi_P$: self-attention coherence scoring
Training loss:
$$\mathcal{L}{\text{CMN}} = -0.6\log\sigma!\left(\frac{1 - \cos(\bar{\mathbf{c}}}}, \bar{\mathbf{c}{\text{fact}})}{\tau}\right) + 0.4\cdot\text{BCE}(\text{pla}(\mathbf{c})$$}}), \mathbf{1
The contrastive term rewards novelty (confabulation ≠ fact); the BCE term enforces coherence. Inference threshold: $\text{nov} \geq 0.45$, $\text{pla} \geq 0.50$.
3.3 Uncertainty Crystallization (UC)¶
Three sequential stages:
- MC-Dropout ($N=30$ passes): $\sigma^2_{\text{ep}} = \text{Var}n[p^{(n)}\theta]$, $u_{\text{al}} = \mathbb{E}n[H(p^{(n)}\theta)]$
- Temperature Scaling: $T^* = \arg\min_T \mathcal{L}_{\text{NLL}}(z/T, y)$ → calibrated confidence $\hat{p}$
- Conformal Prediction: nonconformity score $\alpha_i = 1-\hat{p}_i$; interval $[\hat{p}-\hat{q},\, \hat{p}+\hat{q}]$ with $\Pr[y \in C(x)] \geq 0.90$
Reliability tiers:
| Tier | Condition | Action |
|---|---|---|
| ◆ Crystal | $\hat{p} \geq 0.85$, $\sigma^2_{\text{ep}} < 0.05$ | Use directly |
| ◇ Solid | $\hat{p} \geq 0.65$, $\sigma^2_{\text{ep}} < 0.15$ | Light verification |
| ≈ Fluid | $\hat{p} \geq 0.40$, $\sigma^2_{\text{ep}} < 0.35$ | Verify before use |
| ~ Vapor | otherwise | Do not use |
4. Experiments¶
4.1 HGT — Hallucination Detection¶
| Model | Dataset | Method | AUROC ↑ | F1 ↑ |
|---|---|---|---|---|
| GPT-2 | Wiki Bio | SAPLMA | 0.71 | 0.68 |
| GPT-2 | Wiki Bio | SelfCheckGPT | 0.74 | 0.70 |
| GPT-2 | Wiki Bio | HGT | 0.79 | 0.76 |
| LLaMA-7B | Vectara HFB | SelfCheckGPT | 0.75 | 0.72 |
| LLaMA-7B | Vectara HFB | HGT | 0.83 | 0.80 |
HGT surpasses SelfCheckGPT by +9.2% AUROC on LLaMA-7B while requiring ~10× fewer forward passes (1 vs. 20).
4.2 UC — Calibration¶
| Model | Method | ECE ↓ | MCE ↓ | Brier ↓ |
|---|---|---|---|---|
| GPT-2 | Uncalibrated | 0.187 | 0.312 | 0.241 |
| GPT-2 | Temp. scaling | 0.089 | 0.201 | 0.198 |
| GPT-2 | UC | 0.041 | 0.098 | 0.172 |
| LLaMA-7B | Uncalibrated | 0.143 | 0.267 | 0.209 |
| LLaMA-7B | UC | 0.029 | 0.071 | 0.161 |
78% ECE reduction on GPT-2. Conformal 90% interval achieves 91.3% empirical coverage, satisfying the theoretical guarantee.
4.3 CMN — Hypothesis Mining¶
| Method | Hypotheses | Novel@5 | Expert Plausibility |
|---|---|---|---|
| Filtered LLM | 200 | 0.08 | 89% |
| RAG-augmented | 312 | 0.13 | 83% |
| CMN | 847 | 0.67 | 77% |
CMN surfaces 4.2× more hypotheses with 8.4× greater novelty, at 77% expert plausibility — high for genuinely undiscovered territory.
5. Conclusion¶
PHANTASM establishes a new paradigm: failure-mode mining. Hallucination traces knowledge boundaries. Confabulation seeds hypothesis generation. Miscalibration enables precision uncertainty oracles. All three pillars operate post-hoc, model-agnostic, on any causal LM without weight modification. PHANTASM is available at pip install phantasm-llm under Apache 2.0. Every model that hallucinates is telling you exactly where it is blind. PHANTASM listens.
References¶
Angelopoulos & Bates (2022). A gentle introduction to conformal prediction. arXiv:2107.07511. · Azaria & Mitchell (2023). The internal state of an LLM knows when it's lying. EMNLP Findings. · Bran et al. (2023). ChemCrow. arXiv:2304.05376. · Guo et al. (2017). On calibration of modern neural networks. ICML. · Kadavath et al. (2022). Language models (mostly) know what they know. arXiv:2207.05221. · Lewis et al. (2020). RAG for knowledge-intensive NLP. NeurIPS. · Manakul et al. (2023). SelfCheckGPT. EMNLP. · Min et al. (2023). FActScoring. ACL. · Müller et al. (2019). When does label smoothing help? NeurIPS. · Ouyang et al. (2022). InstructGPT. NeurIPS. · Radford et al. (2019). GPT-2. OpenAI Blog. · Touvron et al. (2023). LLaMA. arXiv:2302.13971. · Wang et al. (2023). Scientific discovery in the age of AI. Nature.