Dynamic Graph Attention · Temporal Decay · Adaptive Depth Routing
The first transformer to simultaneously break all three assumptions every model since 2017 has shared — built from scratch in PyTorch, fully tested, open source.
Every transformer since Vaswani et al. (2017) makes the same three assumptions. TMT breaks all three simultaneously — in a single unified forward pass.
| Architecture | Dynamic Graph | Per-Token Depth | Semantic Decay | Persistent Memory | Dual-Stream FFN | All Combined |
|---|---|---|---|---|---|---|
| GPT / LLaMA | ✗ None | ✗ None | ✗ None | ✗ None | ✗ None | ✗ |
| Graph Transformer | ⚠ Fixed only | ✗ None | ✗ None | ✗ None | ✗ None | ✗ |
| Early Exit Transformer | ✗ None | ⚠ Classification only | ✗ None | ✗ None | ✗ None | ✗ |
| Mixture of Experts | ✗ None | ⚠ Layer-level only | ✗ None | ✗ None | ✗ None | ✗ |
| Perceiver IO | ✗ None | ✗ None | ✗ None | ⚠ Fixed latent | ✗ None | ✗ |
| TemporalMesh Transformer | ✓ Dynamic kNN | ✓ Per-token gate | ✓ Learned decay | ✓ EMA anchors | ✓ Syntax+Semantic | ✓ All 5 |
All 8 ablation configurations measured on WikiText-2 at equal parameter budget (120M). Results from the TMT-Benchmarks dataset.
| Configuration | Mesh | Decay | Exit | Val PPL ↓ | Avg Layers | Rel. Compute ↓ | Δ PPL vs Vanilla |
|---|---|---|---|---|---|---|---|
| Vanilla Transformer | ✗ | ✗ | ✗ | 42.1 | 12.0 | 1.00× | — |
| Mesh Attention Only | ✓ | ✗ | ✗ | 37.8 | 12.0 | 0.62× | -10.2% |
| Temporal Decay Only | ✗ | ✓ | ✗ | 40.3 | 12.0 | 0.98× | -4.3% |
| Adaptive Exit Only | ✗ | ✗ | ✓ | 39.6 | 5.8 | 0.51× | -5.9% |
| Mesh + Decay | ✓ | ✓ | ✗ | 34.2 | 12.0 | 0.61× | -18.8% |
| Mesh + Exit | ✓ | ✗ | ✓ | 35.1 | 5.7 | 0.50× | -16.6% |
| Decay + Exit | ✗ | ✓ | ✓ | 37.0 | 5.9 | 0.50× | -12.1% |
| Full TMT (all 3) | ✓ | ✓ | ✓ | 29.4 | 5.5 | 0.48× | -30.2% |
Each innovation independently improves quality or efficiency. Together they are synergistic — the combined gain exceeds the sum of parts.
Tokens are nodes in a graph. At every layer, MeshBuilder computes pairwise cosine similarity and keeps only the top-k edges per token. Attention flows exclusively along these edges.
Key distinction from Graph Transformers: the graph is NOT fixed. It rebuilds after every layer from the updated token representations. Topology is emergent.
Each token carries a learned decay scalar. Semantically distant tokens are attenuated — the decay is multiplied into attention weights, not added as a positional bias.
Key distinction from RoPE/ALiBi: decays based on learned semantic relevance, not just absolute position.
After each layer norm, a single linear → sigmoid gives each token a confidence score. If confidence exceeds the threshold, the token is frozen and skips remaining layers.
Result: avg 5.5 layers used vs 12. That is 2.1× compute efficiency with lower perplexity.
Instead of one FFN (d → 4d → d), two parallel half-width streams process syntax and semantic representations independently. A learned sigmoid gate fuses them.
16 persistent nn.Parameter vectors act as global memory anchors. Every token cross-attends to these anchors each layer. After each forward pass, anchors are updated by EMA.
Every forward pass returns a TMTOutput dataclass — logits plus full diagnostic tensors for interpretability.
| Module | Input | Output | Learnable params | Role |
|---|---|---|---|---|
| TokenEmbedding | (B,S) int | (B,S,D) | vocab_size × D | Token + scale init |
| TemporalPositionEncoder | (B,S,D) | (B,S,D) | D + n_heads | RoPE + decay scalars |
| MeshBuilder | (B,S,D) | edge_index, weights | 0 | Builds kNN graph |
| MeshAttention | (B,S,D) + graph | (B,S,D) | 4 × D²/n_heads | Graph-constrained MHSA |
| DualStreamFFN | (B,S,D) | (B,S,D) | 4 × D × ffn_dim + D | Syntax+semantic FFN |
| ExitGate | (B,S,D) | (B,S) bool | D + 1 | Per-token early exit |
| MemoryAnchorCross | (B,S,D) + (M,D) | (B,S,D) | 3 × D² + M × D | Persistent memory cross-attn |
| OutputProjection | (B,S,D) | (B,S,V) | tied to embedding | Vocabulary logits |
git clone https://github.com/vignesh2027/TemporalMesh-Transformer.git cd TemporalMesh-Transformer python3 -m venv .venv && source .venv/bin/activate pip install torch einops && pip install -e .
pytest tests/ -v # Expected: 226 passed in ~3s
from tmt.model.config import TMTConfig from tmt.model.model import TMTModel import torch model = TMTModel(TMTConfig(vocab_size=50258, d_model=256, n_heads=4, n_layers=4)) out = model(torch.randint(0, 50258, (1, 64))) # Inspect all outputs: print(out.logits.shape) # (1, 64, 50258) print(out.exit_masks[-1].float().mean()) # fraction exited early print(out.graph_edges[0].shape) # dynamic graph edge_index print(out.memory_state.shape) # (16, 256) anchor state
Full training pipeline included. Auto-downloads WikiText-2 from HuggingFace datasets.
from tmt.model.config import TMTConfig from tmt.training.trainer import TMTTrainer, TrainConfig from tmt.data.dataset import load_text_dataset cfg = TMTConfig( vocab_size=50258, d_model=256, n_heads=4, n_layers=4, graph_k=4, ffn_stream_dim=128, memory_anchors=8, ) loaders = load_text_dataset('wikitext-2', batch_size=8) trainer = TMTTrainer(cfg, TrainConfig(total_steps=500), loaders['train'], loaders['validation']) trainer.train()
step= 50 | loss=8.76 | ce=8.79 | gate=-0.25 | exit=1.000 | lr=3e-4 loss -- total = CE + 0.1 x gate_auxiliary ce -- cross-entropy next-token prediction gate -- auxiliary loss (negative = gates decisive) exit -- fraction of tokens that exited early lr -- current learning rate (cosine warmup) exit going 0.000 to 1.000 means the adaptive depth routing has learned to fire reliably.
| Field | Shape | Type | What it contains |
|---|---|---|---|
logits | (B, S, V) | float | Vocabulary distribution — use for loss and sampling |
exit_masks | list[(B,S)] | bool | Per-layer early exit decisions — True = token froze here |
confidences | list[(B,S)] | float | Per-token gate confidence scores ∈ (0,1) |
graph_edges | (2,E),(E,) | long/float | Final dynamic graph: edge_index and cosine weights |
memory_state | (M, D) | float | 16 persistent memory anchor vectors after EMA update |
decay_scalars | (B, S, D) | float | Temporal decay applied to each token dimension |
| Config | d_model | Layers | Parameters | VRAM / RAM | Time (10k steps) | Use case |
|---|---|---|---|---|---|---|
| Tiny (test) | 64 | 2 | ~1M | ~256MB RAM | ~1 min CPU | CI / unit tests |
| Small | 256 | 4 | ~16M | ~2GB RAM | ~10 min CPU | Quick experiments |
| Medium | 512 | 6 | ~60M | ~6GB VRAM | ~45 min GPU | Ablation study |
| Full TMT (paper) | 512 | 12 | ~120M | ~12GB VRAM | ~2–3 hrs GPU | Paper results |
Apple Silicon (M1/M2/M3) detected automatically via MPS. CUDA detected automatically. CPU fallback always works.
Four Jupyter notebooks to reproduce the full ablation study — run in order.
Standard GPT-style decoder at equal parameter budget. Control group. PPL baseline: 42.1.
Isolates Innovation 1. Decay and exit disabled. PPL: 37.8. Compute: 0.62×.
All 3 innovations active. The main result. PPL: 29.4. Compute: 0.48×.
All 8 ablation configs in one table + bar chart. Fill PPL values after running 01–03.
source .venv/bin/activate pip install jupyter matplotlib pandas jupyter notebook tmt/experiments/01_baseline.ipynb
Where TMT sits relative to the five closest lines of prior work.
| Paper | Year | Core Idea | TMT Relation |
|---|---|---|---|
| Vaswani et al. — Attention is All You Need | 2017 | Transformer baseline | TMT base architecture |
| Su et al. — RoFormer (RoPE) | 2021 | Rotary positional encoding | TMT uses RoPE, extends with learned decay |
| Velickovic et al. — Graph Attention Networks | 2018 | Attention on fixed graphs | TMT uses dynamic topology, not fixed |
| Elbayad et al. — Depth-Adaptive Transformer | 2020 | Early exit for classification | TMT extends to generation, per-token |
| Graves — Adaptive Computation Time | 2016 | Halt RNN tokens early | TMT is the transformer equivalent |
| Weston et al. — Memory Networks | 2015 | External memory for QA | TMT uses EMA-updated persistent anchors |
| Vigneshwar — TemporalMesh Transformer | 2026 | All five combined in one model | This work |
135+ downloads. Evaluation and testing dataset for all TMT ablation experiments.
8 rows — canonical perplexity for all 8 ablation configurations. The gold standard results table.
1,200 rows — O(S²) vs O(S·k) complexity comparison at S=32 through S=1024.
1,000 token complexity annotations with expected exit layer assignments.
Exit layer distributions by token type — punctuation vs rare terms vs common words.
15 boundary and adversarial inputs for robustness testing of all modules.
from datasets import load_dataset ds = load_dataset("vigneshwar234/TMT-Benchmarks", "ablation_reference") print(ds["test"].to_pandas()) # PPL 29.4 vs 42.1 ablation table
If TMT helps your research, please cite the Zenodo preprint.
@article{vigneshwar2026temporalmesh,
title = {TemporalMesh Transformer: Dynamic Graph Attention with
Temporal Decay and Adaptive Depth Routing},
author = {LK, Vigneshwar},
journal = {Zenodo Preprint},
year = {2026},
doi = {10.5281/zenodo.20287197},
url = {https://zenodo.org/records/20287390}
}