Skip to content

Results

HGT — Hallucination Detection

Model Dataset Method AUROC F1 Precision Recall
GPT-2 Wiki Bio SAPLMA 0.71 0.68 0.72 0.65
GPT-2 Wiki Bio SelfCheckGPT 0.74 0.70 0.75 0.66
GPT-2 Wiki Bio PHANTASM HGT 0.79 0.76 0.80 0.72
LLaMA-7B Vectara HFB Fact-checker 0.76 0.73 0.78 0.69
LLaMA-7B Vectara HFB PHANTASM HGT 0.83 0.80 0.85 0.76

UC — Calibration

Model Method ECE ↓ MCE ↓
GPT-2 Baseline 0.187 0.312
GPT-2 Temperature scaling 0.089 0.201
GPT-2 PHANTASM UC 0.041 0.098
LLaMA-7B Baseline 0.143 0.267
LLaMA-7B PHANTASM UC 0.029 0.071

CMN — Hypothesis Quality

Domain Method Novelty@5 Expert Acceptance Rate
Drug discovery Filtered LLM 0.08 7%
Drug discovery PHANTASM CMN 0.67 54%
Material science Filtered LLM 0.11 9%
Material science PHANTASM CMN 0.61 48%

PHANTASM Score

The composite PHANTASM score combines all three pillars:

PHANTASM_score = 0.4 * HGT_F1  +  0.3 * CMN_Novelty@5  +  0.3 * (1 - UC_ECE)
Model PHANTASM Score
GPT-2 (baseline) 0.421
GPT-2 + PHANTASM 0.701
LLaMA-7B + PHANTASM 0.774