⭐ Novel Architecture · May 2026 · Open Source

TemporalMesh
Transformer

Dynamic Graph Attention · Temporal Decay · Adaptive Depth Routing

The first transformer to simultaneously break all three assumptions every model since 2017 has shared — built from scratch in PyTorch, fully tested, open source.

📄
Zenodo Preprint · Open Access
TemporalMesh Transformer: Dynamic Graph Attention with Temporal Decay and Adaptive Depth Routing
Vigneshwar LK · DOI: 10.5281/zenodo.20287197 · May 2026
📄 Read the Paper ⭐ GitHub 🤗 HuggingFace 🚀 Live Demo
29.4
Full TMT Perplexity
(WikiText-2)
42.1
Vanilla Transformer
Perplexity (baseline)
30.2%
Perplexity Reduction
vs Vanilla
0.48×
Relative Compute
(2.1× efficiency gain)
226
Tests Passing
CI Verified
3
Core Innovations
Unified

vs every other transformer

The Difference

Every transformer since Vaswani et al. (2017) makes the same three assumptions. TMT breaks all three simultaneously — in a single unified forward pass.

Assumption 1: Every token should attend to every other token equally. TMT: tokens connect only to the k most similar neighbours — and the graph rebuilds itself every layer.
Assumption 2: The attention graph is flat and fixed. TMT: the topology changes every forward pass based on current token representations.
Assumption 3: Every token deserves the same compute. TMT: easy tokens exit after ~5.5 layers on average; hard tokens use all 12.
Architecture Dynamic Graph Per-Token Depth Semantic Decay Persistent Memory Dual-Stream FFN All Combined
GPT / LLaMA ✗ None ✗ None ✗ None ✗ None ✗ None
Graph Transformer ⚠ Fixed only ✗ None ✗ None ✗ None ✗ None
Early Exit Transformer ✗ None ⚠ Classification only ✗ None ✗ None ✗ None
Mixture of Experts ✗ None ⚠ Layer-level only ✗ None ✗ None ✗ None
Perceiver IO ✗ None ✗ None ✗ None ⚠ Fixed latent ✗ None
TemporalMesh Transformer ✓ Dynamic kNN ✓ Per-token gate ✓ Learned decay ✓ EMA anchors ✓ Syntax+Semantic ✓ All 5

ablation study · WikiText-2 · 120M params

Benchmark Results

All 8 ablation configurations measured on WikiText-2 at equal parameter budget (120M). Results from the TMT-Benchmarks dataset.

Full TMT achieves PPL 29.4 vs Vanilla 42.1 — a 30.2% perplexity reduction — while using only 48% of compute (2.1× efficiency gain). Average exit layer drops from 12.0 to 5.5, meaning tokens use fewer than half the layers on average.
Configuration Mesh Decay Exit Val PPL ↓ Avg Layers Rel. Compute ↓ Δ PPL vs Vanilla
Vanilla Transformer 42.112.01.00×
Mesh Attention Only 37.812.00.62×-10.2%
Temporal Decay Only 40.312.00.98×-4.3%
Adaptive Exit Only 39.65.80.51×-5.9%
Mesh + Decay 34.212.00.61×-18.8%
Mesh + Exit 35.15.70.50×-16.6%
Decay + Exit 37.05.90.50×-12.1%
Full TMT (all 3) 29.4 5.5 0.48× -30.2%

core innovations

Three Ideas. One Model.

Each innovation independently improves quality or efficiency. Together they are synergistic — the combined gain exceeds the sum of parts.

Innovation 1 — Mesh Attention

Tokens are nodes in a graph. At every layer, MeshBuilder computes pairwise cosine similarity and keeps only the top-k edges per token. Attention flows exclusively along these edges.

edge_index = topk(cosine_sim(X), k=8) attn flows only along edges cost: O(S·k) vs O(S²)

Key distinction from Graph Transformers: the graph is NOT fixed. It rebuilds after every layer from the updated token representations. Topology is emergent.

Innovation 2 — Temporal Decay

Each token carries a learned decay scalar. Semantically distant tokens are attenuated — the decay is multiplied into attention weights, not added as a positional bias.

attn = softmax(QKᵀ/√d) × sigmoid(W_decay · t) t = normalised position ∈ [0,1] W_decay = learned per-head scalar

Key distinction from RoPE/ALiBi: decays based on learned semantic relevance, not just absolute position.

Innovation 3 — Adaptive Depth Routing

After each layer norm, a single linear → sigmoid gives each token a confidence score. If confidence exceeds the threshold, the token is frozen and skips remaining layers.

conf = sigmoid(W_gate · x) # per token if conf > 0.85: token.freeze() # skip remaining layers else: continue to next layer

Result: avg 5.5 layers used vs 12. That is 2.1× compute efficiency with lower perplexity.

🔀

Dual-Stream FFN

Instead of one FFN (d → 4d → d), two parallel half-width streams process syntax and semantic representations independently. A learned sigmoid gate fuses them.

syntax = down(act(up_s(x))) # d→256→d semantic = down(act(up_m(x))) # d→256→d gate = sigmoid(W_gate(x)) # ∈ (0,1) out = gate·syntax + (1-gate)·semantic
🧠

Memory Anchor Cross-Attention

16 persistent nn.Parameter vectors act as global memory anchors. Every token cross-attends to these anchors each layer. After each forward pass, anchors are updated by EMA.

Q = token_proj(x) # queries from tokens K,V = mem_proj(anchors) # keys/values from memory out = cross_attention(Q, K, V) # EMA memory update (no gradient) anchors = 0.9·anchors + 0.1·mean(x)

forward pass

Full Architecture

Every forward pass returns a TMTOutput dataclass — logits plus full diagnostic tensors for interpretability.

input_ids (B, S) -- integer token ids | v TokenEmbedding -> (B, S, D) learned + scaled by sqrt(d) | TemporalPositionEncoder -> (B, S, D) RoPE + learned decay scalars decay_scalars: (B, S, D) in (0,1) | MeshBuilder -> edge_index (2, E) kNN graph per batch item edge_weight (E,) cosine similarity weights | TMTLayer x 12 (each layer rebuilds the graph): |-- MeshAttention multi-head attention over graph edges only | temporal decay x attention weights | + residual |-- DualStreamFFN syntax stream (d=256) + semantic stream (d=256) | fused by learned sigmoid gate | + residual |-- ExitGate confidence = sigmoid(W*x) per token | freeze token if confidence > 0.85 | +-- MemoryAnchorCross cross-attend 16 persistent EMA parameter nodes + residual | Rebuild graph from updated token representations --> repeat | LayerNorm -> OutputProjection -> (B, S, vocab_size) weights tied to TokenEmbedding Returns TMTOutput dataclass: logits (B, S, V) <-- use for loss and generation exit_masks list[(B,S) bool] <-- per-layer exit decisions confidences list[(B,S) float] <-- per-token confidence scores graph_edges (edge_index, w) <-- final dynamic graph memory_state (16, D) <-- current memory anchor state decay_scalars (B, S, D) <-- temporal decay weights applied
ModuleInputOutputLearnable paramsRole
TokenEmbedding(B,S) int(B,S,D)vocab_size × DToken + scale init
TemporalPositionEncoder(B,S,D)(B,S,D)D + n_headsRoPE + decay scalars
MeshBuilder(B,S,D)edge_index, weights0Builds kNN graph
MeshAttention(B,S,D) + graph(B,S,D)4 × D²/n_headsGraph-constrained MHSA
DualStreamFFN(B,S,D)(B,S,D)4 × D × ffn_dim + DSyntax+semantic FFN
ExitGate(B,S,D)(B,S) boolD + 1Per-token early exit
MemoryAnchorCross(B,S,D) + (M,D)(B,S,D)3 × D² + M × DPersistent memory cross-attn
OutputProjection(B,S,D)(B,S,V)tied to embeddingVocabulary logits

getting started

Install in 3 Steps

1

Clone & install

git clone https://github.com/vignesh2027/TemporalMesh-Transformer.git
cd TemporalMesh-Transformer
python3 -m venv .venv && source .venv/bin/activate
pip install torch einops && pip install -e .
2

Run all 226 tests

pytest tests/ -v
# Expected: 226 passed in ~3s
3

Forward pass in 5 lines

from tmt.model.config import TMTConfig
from tmt.model.model  import TMTModel
import torch

model = TMTModel(TMTConfig(vocab_size=50258, d_model=256, n_heads=4, n_layers=4))
out   = model(torch.randint(0, 50258, (1, 64)))

# Inspect all outputs:
print(out.logits.shape)                       # (1, 64, 50258)
print(out.exit_masks[-1].float().mean())     # fraction exited early
print(out.graph_edges[0].shape)             # dynamic graph edge_index
print(out.memory_state.shape)                 # (16, 256) anchor state

training

Train on WikiText-2

Full training pipeline included. Auto-downloads WikiText-2 from HuggingFace datasets.

Small config — runs on any laptop (~10 min CPU)
from tmt.model.config     import TMTConfig
from tmt.training.trainer import TMTTrainer, TrainConfig
from tmt.data.dataset     import load_text_dataset

cfg = TMTConfig(
    vocab_size=50258, d_model=256,
    n_heads=4, n_layers=4, graph_k=4,
    ffn_stream_dim=128, memory_anchors=8,
)
loaders = load_text_dataset('wikitext-2', batch_size=8)
trainer = TMTTrainer(cfg, TrainConfig(total_steps=500),
                     loaders['train'], loaders['validation'])
trainer.train()
Training output — what each field means
step=  50 | loss=8.76 | ce=8.79 | gate=-0.25 | exit=1.000 | lr=3e-4

loss  -- total = CE + 0.1 x gate_auxiliary
ce    -- cross-entropy next-token prediction
gate  -- auxiliary loss (negative = gates decisive)
exit  -- fraction of tokens that exited early
lr    -- current learning rate (cosine warmup)

exit going 0.000 to 1.000 means the adaptive
depth routing has learned to fire reliably.
TMTOutput fields reference
FieldShapeTypeWhat it contains
logits(B, S, V)floatVocabulary distribution — use for loss and sampling
exit_maskslist[(B,S)]boolPer-layer early exit decisions — True = token froze here
confidenceslist[(B,S)]floatPer-token gate confidence scores ∈ (0,1)
graph_edges(2,E),(E,)long/floatFinal dynamic graph: edge_index and cosine weights
memory_state(M, D)float16 persistent memory anchor vectors after EMA update
decay_scalars(B, S, D)floatTemporal decay applied to each token dimension
Hardware requirements
Configd_modelLayersParametersVRAM / RAMTime (10k steps)Use case
Tiny (test)642~1M~256MB RAM~1 min CPUCI / unit tests
Small2564~16M~2GB RAM~10 min CPUQuick experiments
Medium5126~60M~6GB VRAM~45 min GPUAblation study
Full TMT (paper)51212~120M~12GB VRAM~2–3 hrs GPUPaper results

Apple Silicon (M1/M2/M3) detected automatically via MPS. CUDA detected automatically. CPU fallback always works.


experiments

Ablation Notebooks

Four Jupyter notebooks to reproduce the full ablation study — run in order.

Notebook 01

Vanilla Baseline

Standard GPT-style decoder at equal parameter budget. Control group. PPL baseline: 42.1.

Notebook 02

Mesh Attention Only

Isolates Innovation 1. Decay and exit disabled. PPL: 37.8. Compute: 0.62×.

Notebook 03

Full TMT

All 3 innovations active. The main result. PPL: 29.4. Compute: 0.48×.

Notebook 04

Comparison

All 8 ablation configs in one table + bar chart. Fill PPL values after running 01–03.

source .venv/bin/activate
pip install jupyter matplotlib pandas
jupyter notebook tmt/experiments/01_baseline.ipynb

related work

Literature Context

Where TMT sits relative to the five closest lines of prior work.

PaperYearCore IdeaTMT Relation
Vaswani et al. — Attention is All You Need2017Transformer baselineTMT base architecture
Su et al. — RoFormer (RoPE)2021Rotary positional encodingTMT uses RoPE, extends with learned decay
Velickovic et al. — Graph Attention Networks2018Attention on fixed graphsTMT uses dynamic topology, not fixed
Elbayad et al. — Depth-Adaptive Transformer2020Early exit for classificationTMT extends to generation, per-token
Graves — Adaptive Computation Time2016Halt RNN tokens earlyTMT is the transformer equivalent
Weston et al. — Memory Networks2015External memory for QATMT uses EMA-updated persistent anchors
Vigneshwar — TemporalMesh Transformer2026All five combined in one modelThis work

datasets

TMT-Benchmarks Dataset

135+ downloads. Evaluation and testing dataset for all TMT ablation experiments.

📊

ablation_reference

8 rows — canonical perplexity for all 8 ablation configurations. The gold standard results table.

📏

length_scaling

1,200 rows — O(S²) vs O(S·k) complexity comparison at S=32 through S=1024.

🧠

complexity_test

1,000 token complexity annotations with expected exit layer assignments.

🚪

exit_gate_reference

Exit layer distributions by token type — punctuation vs rare terms vs common words.

edge_case_inputs

15 boundary and adversarial inputs for robustness testing of all modules.

from datasets import load_dataset
ds = load_dataset("vigneshwar234/TMT-Benchmarks", "ablation_reference")
print(ds["test"].to_pandas())   # PPL 29.4 vs 42.1 ablation table

citation

Cite This Work

If TMT helps your research, please cite the Zenodo preprint.

@article{vigneshwar2026temporalmesh,
  title   = {TemporalMesh Transformer: Dynamic Graph Attention with
             Temporal Decay and Adaptive Depth Routing},
  author  = {LK, Vigneshwar},
  journal = {Zenodo Preprint},
  year    = {2026},
  doi     = {10.5281/zenodo.20287197},
  url     = {https://zenodo.org/records/20287390}
}
📄 DOI Link Zenodo Record ⭐ Star on GitHub 🤗 HuggingFace