The first transformer to combine dynamic graph topology, token-level adaptive compute, and temporal semantic decay into a single unified model — built from scratch in PyTorch.
Every existing transformer makes three assumptions: all tokens are equally important, the sequence is flat, and every token deserves the same compute budget. TMT breaks all three at once. The graph topology changes every forward pass. Tokens exit at different depths. Semantically distant tokens fade. No prior architecture does all three together.
Every column below is a limitation in the existing architecture that TMT addresses.
| Architecture | Dynamic Graph | Per-Token Depth | Temporal Decay | Persistent Memory | Dual-Stream FFN |
|---|---|---|---|---|---|
| Standard Transformer (GPT) | ✗ No | ✗ No | ✗ No | ✗ No | ✗ No |
| Graph Transformer | ✗ Fixed only | ✗ No | ✗ No | ✗ No | ✗ No |
| Early Exit Transformer | ✗ No | ✓ Partial | ✗ No | ✗ No | ✗ No |
| RoPE / ALiBi | ✗ No | ✗ No | ✗ Position only | ✗ No | ✗ No |
| Mixture of Experts | ✗ No | ✗ Layer-level only | ✗ No | ✗ No | ✗ No |
| TemporalMesh Transformer | ✓ Dynamic | ✓ Per-token | ✓ Semantic | ✓ EMA anchors | ✓ Yes |
Each innovation solves a different fundamental limitation of the standard transformer.
Tokens are nodes in a graph. Edges are recomputed each forward pass using cosine similarity. Only top-k neighbours are connected. O(S·k) instead of O(S²). The graph topology itself changes as token meanings evolve through layers.
Each token carries a learned decay scalar. Semantically distant tokens are attenuated before they reach the attention layer. The decay is multiplied into attention weights — not added as a positional bias.
After each layer, a single linear → sigmoid gives each token a confidence score. If confidence > 0.85, that token freezes and skips all remaining layers. Easy tokens use 2 layers. Hard tokens use 12. ~50% compute saved.
Every forward pass returns a TMTOutput dataclass — logits plus full diagnostic tensors.
Four steps from zero to a running model.
git clone https://github.com/vignesh2027/TemporalMesh-Transformer.git
cd TemporalMesh-Transformer
Required on macOS with Homebrew Python.
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pytest tests/ -v
from tmt.model.config import TMTConfig
from tmt.training.trainer import TMTTrainer, TrainConfig
from tmt.data.dataset import load_text_dataset
cfg = TMTConfig(vocab_size=50258, d_model=256, n_heads=4, n_layers=4)
loaders = load_text_dataset('wikitext-2', seq_len=128, batch_size=8)
trainer = TMTTrainer(cfg, TrainConfig(total_steps=500),
loaders['train'], loaders['validation'])
trainer.train()
Run these four notebooks in order to reproduce the ablation study and measure each innovation's contribution.
Standard GPT-style transformer. Same parameter budget. Control group for comparison.
Mesh attention active, decay and exit gate disabled. Isolates Innovation 1.
All three innovations active. The main experimental result.
Perplexity table + bar chart across all three configurations.
@misc{tmt2026,
title = {TemporalMesh Transformer: Dynamic Graph Attention with
Temporal Decay and Adaptive Depth Routing},
author = {Vignesh},
year = {2026},
url = {https://github.com/vignesh2027/TemporalMesh-Transformer}
}