Skip to content

Pillar I — Hallucination Gradient Tracing (HGT)

Core idea

Hallucinations occur at the edges of the model's training distribution. The gradient of the loss with respect to input embeddings spikes at exactly those positions — where the model is uncertain. HGT captures this spike and converts it into a Competency Atlas.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from phantasm.core.hgt import HallucinationGradientTracer

model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

tracer = HallucinationGradientTracer(model, tokenizer, threshold=0.35)
atlas = tracer.trace("The Eiffel Tower was built in 1492 by Napoleon.")

print(atlas.overall_hallucination_risk)   # float in [0, 1]
print(atlas.boundary_tokens)              # ['1492', 'Napoleon']
print(atlas.knowledge_gaps)              # [{'span': '1492', 'confidence': 0.09, ...}]

CompetencyAtlas fields

Field Type Description
token_scores Tensor (seq_len,) Per-token confidence. 1=grounded, 0=boundary.
layer_gradients Tensor (n_layers, seq_len) Raw gradient norms per layer.
boundary_tokens list[str] Tokens below the confidence threshold.
knowledge_gaps list[dict] Structured gap info with severity.
overall_hallucination_risk float Aggregate risk for the full text.

Offline risk scoring (no model required)

from phantasm.core.hgt import score_hallucination_risk

# Overlap-based (needs reference)
risk = score_hallucination_risk(generation, reference=ref, method="overlap")

# Entropy-based (no reference needed)
risk = score_hallucination_risk(generation, method="entropy")