⭐ Novel Architecture · May 2026 · Open Source

TemporalMesh
Transformer

Dynamic Graph Attention · Temporal Decay · Adaptive Depth Routing

The first transformer to simultaneously break all three assumptions every model since 2017 has shared — built from scratch in PyTorch, fully tested, open source.

📄

Zenodo Preprint · Open Access

TemporalMesh Transformer: Dynamic Graph Attention with Temporal Decay and Adaptive Depth Routing

Vigneshwar LK · DOI: 10.5281/zenodo.20287197 · May 2026

📄 Read the Paper ⭐ GitHub 🤗 HuggingFace 🚀 Live Demo

vs every other transformer

The Difference

Every transformer since Vaswani et al. (2017) makes the same three assumptions. TMT breaks all three simultaneously — in a single unified forward pass.

Assumption 1: Every token should attend to every other token equally. TMT: tokens connect only to the k most similar neighbours — and the graph rebuilds itself every layer.
Assumption 2: The attention graph is flat and fixed. TMT: the topology changes every forward pass based on current token representations.
Assumption 3: Every token deserves the same compute. TMT: easy tokens exit after ~5.5 layers on average; hard tokens use all 12.

Architecture	Dynamic Graph	Per-Token Depth	Semantic Decay	Persistent Memory	Dual-Stream FFN	All Combined
GPT / LLaMA	✗ None	✗ None	✗ None	✗ None	✗ None	✗
Graph Transformer	⚠ Fixed only	✗ None	✗ None	✗ None	✗ None	✗
Early Exit Transformer	✗ None	⚠ Classification only	✗ None	✗ None	✗ None	✗
Mixture of Experts	✗ None	⚠ Layer-level only	✗ None	✗ None	✗ None	✗
Perceiver IO	✗ None	✗ None	✗ None	⚠ Fixed latent	✗ None	✗
TemporalMesh Transformer	✓ Dynamic kNN	✓ Per-token gate	✓ Learned decay	✓ EMA anchors	✓ Syntax+Semantic	✓ All 5

ablation study · WikiText-2 · 120M params

Benchmark Results

All 8 ablation configurations measured on WikiText-2 at equal parameter budget (120M). Results from the TMT-Benchmarks dataset.

Full TMT achieves PPL 29.4 vs Vanilla 42.1 — a 30.2% perplexity reduction — while using only 48% of compute (2.1× efficiency gain). Average exit layer drops from 12.0 to 5.5, meaning tokens use fewer than half the layers on average.

Configuration	Mesh	Decay	Exit	Val PPL ↓	Avg Layers	Rel. Compute ↓	Δ PPL vs Vanilla
Vanilla Transformer	✗	✗	✗	42.1	12.0	1.00×	—
Mesh Attention Only	✓	✗	✗	37.8	12.0	0.62×	-10.2%
Temporal Decay Only	✗	✓	✗	40.3	12.0	0.98×	-4.3%
Adaptive Exit Only	✗	✗	✓	39.6	5.8	0.51×	-5.9%
Mesh + Decay	✓	✓	✗	34.2	12.0	0.61×	-18.8%
Mesh + Exit	✓	✗	✓	35.1	5.7	0.50×	-16.6%
Decay + Exit	✗	✓	✓	37.0	5.9	0.50×	-12.1%
Full TMT (all 3)	✓	✓	✓	29.4	5.5	0.48×	-30.2%

core innovations

Three Ideas. One Model.

Each innovation independently improves quality or efficiency. Together they are synergistic — the combined gain exceeds the sum of parts.

⬡

Innovation 1 — Mesh Attention

Tokens are nodes in a graph. At every layer, MeshBuilder computes pairwise cosine similarity and keeps only the top-k edges per token. Attention flows exclusively along these edges.

edge_index = topk(cosine_sim(X), k=8) attn flows only along edges cost: O(S·k) vs O(S²)

Key distinction from Graph Transformers: the graph is NOT fixed. It rebuilds after every layer from the updated token representations. Topology is emergent.

⏱

Innovation 2 — Temporal Decay

Each token carries a learned decay scalar. Semantically distant tokens are attenuated — the decay is multiplied into attention weights, not added as a positional bias.

attn = softmax(QKᵀ/√d) × sigmoid(W_decay · t) t = normalised position ∈ [0,1] W_decay = learned per-head scalar

Key distinction from RoPE/ALiBi: decays based on learned semantic relevance, not just absolute position.

⚡

Innovation 3 — Adaptive Depth Routing

After each layer norm, a single linear → sigmoid gives each token a confidence score. If confidence exceeds the threshold, the token is frozen and skips remaining layers.

conf = sigmoid(W_gate · x) # per token if conf > 0.85: token.freeze() # skip remaining layers else: continue to next layer

Result: avg 5.5 layers used vs 12. That is 2.1× compute efficiency with lower perplexity.

🔀

Dual-Stream FFN

Instead of one FFN (d → 4d → d), two parallel half-width streams process syntax and semantic representations independently. A learned sigmoid gate fuses them.

syntax = down(act(up_s(x))) # d→256→d semantic = down(act(up_m(x))) # d→256→d gate = sigmoid(W_gate(x)) # ∈ (0,1) out = gate·syntax + (1-gate)·semantic

🧠

Memory Anchor Cross-Attention

16 persistent nn.Parameter vectors act as global memory anchors. Every token cross-attends to these anchors each layer. After each forward pass, anchors are updated by EMA.

Q = token_proj(x) # queries from tokens K,V = mem_proj(anchors) # keys/values from memory out = cross_attention(Q, K, V) # EMA memory update (no gradient) anchors = 0.9·anchors + 0.1·mean(x)

forward pass

Full Architecture

Every forward pass returns a TMTOutput dataclass — logits plus full diagnostic tensors for interpretability.

input_ids (B, S) -- integer token ids | v TokenEmbedding -> (B, S, D) learned + scaled by sqrt(d) | TemporalPositionEncoder -> (B, S, D) RoPE + learned decay scalars decay_scalars: (B, S, D) in (0,1) | MeshBuilder -> edge_index (2, E) kNN graph per batch item edge_weight (E,) cosine similarity weights | TMTLayer x 12 (each layer rebuilds the graph): |-- MeshAttention multi-head attention over graph edges only | temporal decay x attention weights | + residual |-- DualStreamFFN syntax stream (d=256) + semantic stream (d=256) | fused by learned sigmoid gate | + residual |-- ExitGate confidence = sigmoid(W*x) per token | freeze token if confidence > 0.85 | +-- MemoryAnchorCross cross-attend 16 persistent EMA parameter nodes + residual | Rebuild graph from updated token representations --> repeat | LayerNorm -> OutputProjection -> (B, S, vocab_size) weights tied to TokenEmbedding Returns TMTOutput dataclass: logits (B, S, V) <-- use for loss and generation exit_masks list[(B,S) bool] <-- per-layer exit decisions confidences list[(B,S) float] <-- per-token confidence scores graph_edges (edge_index, w) <-- final dynamic graph memory_state (16, D) <-- current memory anchor state decay_scalars (B, S, D) <-- temporal decay weights applied

Module	Input	Output	Learnable params	Role
TokenEmbedding	(B,S) int	(B,S,D)	vocab_size × D	Token + scale init
TemporalPositionEncoder	(B,S,D)	(B,S,D)	D + n_heads	RoPE + decay scalars
MeshBuilder	(B,S,D)	edge_index, weights	0	Builds kNN graph
MeshAttention	(B,S,D) + graph	(B,S,D)	4 × D²/n_heads	Graph-constrained MHSA
DualStreamFFN	(B,S,D)	(B,S,D)	4 × D × ffn_dim + D	Syntax+semantic FFN
ExitGate	(B,S,D)	(B,S) bool	D + 1	Per-token early exit
MemoryAnchorCross	(B,S,D) + (M,D)	(B,S,D)	3 × D² + M × D	Persistent memory cross-attn
OutputProjection	(B,S,D)	(B,S,V)	tied to embedding	Vocabulary logits

getting started

Install in 3 Steps

Clone & install

git clone https://github.com/vignesh2027/TemporalMesh-Transformer.git
cd TemporalMesh-Transformer
python3 -m venv .venv && source .venv/bin/activate
pip install torch einops && pip install -e .

Run all 226 tests

pytest tests/ -v
# Expected: 226 passed in ~3s

Forward pass in 5 lines

from tmt.model.config import TMTConfig
from tmt.model.model  import TMTModel
import torch

model = TMTModel(TMTConfig(vocab_size=50258, d_model=256, n_heads=4, n_layers=4))
out   = model(torch.randint(0, 50258, (1, 64)))

# Inspect all outputs:
print(out.logits.shape)                       # (1, 64, 50258)
print(out.exit_masks[-1].float().mean())     # fraction exited early
print(out.graph_edges[0].shape)             # dynamic graph edge_index
print(out.memory_state.shape)                 # (16, 256) anchor state

training

Train on WikiText-2

Full training pipeline included. Auto-downloads WikiText-2 from HuggingFace datasets.

Small config — runs on any laptop (~10 min CPU)

from tmt.model.config     import TMTConfig
from tmt.training.trainer import TMTTrainer, TrainConfig
from tmt.data.dataset     import load_text_dataset

cfg = TMTConfig(
    vocab_size=50258, d_model=256,
    n_heads=4, n_layers=4, graph_k=4,
    ffn_stream_dim=128, memory_anchors=8,
)
loaders = load_text_dataset('wikitext-2', batch_size=8)
trainer = TMTTrainer(cfg, TrainConfig(total_steps=500),
                     loaders['train'], loaders['validation'])
trainer.train()

Training output — what each field means

step=  50 | loss=8.76 | ce=8.79 | gate=-0.25 | exit=1.000 | lr=3e-4

loss  -- total = CE + 0.1 x gate_auxiliary
ce    -- cross-entropy next-token prediction
gate  -- auxiliary loss (negative = gates decisive)
exit  -- fraction of tokens that exited early
lr    -- current learning rate (cosine warmup)

exit going 0.000 to 1.000 means the adaptive
depth routing has learned to fire reliably.

TMTOutput fields reference

Field	Shape	Type	What it contains
`logits`	(B, S, V)	float	Vocabulary distribution — use for loss and sampling
`exit_masks`	list[(B,S)]	bool	Per-layer early exit decisions — True = token froze here
`confidences`	list[(B,S)]	float	Per-token gate confidence scores ∈ (0,1)
`graph_edges`	(2,E),(E,)	long/float	Final dynamic graph: edge_index and cosine weights
`memory_state`	(M, D)	float	16 persistent memory anchor vectors after EMA update
`decay_scalars`	(B, S, D)	float	Temporal decay applied to each token dimension

Hardware requirements

Config	d_model	Layers	Parameters	VRAM / RAM	Time (10k steps)	Use case
Tiny (test)	64	2	~1M	~256MB RAM	~1 min CPU	CI / unit tests
Small	256	4	~16M	~2GB RAM	~10 min CPU	Quick experiments
Medium	512	6	~60M	~6GB VRAM	~45 min GPU	Ablation study
Full TMT (paper)	512	12	~120M	~12GB VRAM	~2–3 hrs GPU	Paper results

Apple Silicon (M1/M2/M3) detected automatically via MPS. CUDA detected automatically. CPU fallback always works.

experiments

Ablation Notebooks

Four Jupyter notebooks to reproduce the full ablation study — run in order.

Notebook 01

Vanilla Baseline

Standard GPT-style decoder at equal parameter budget. Control group. PPL baseline: 42.1.

Notebook 02

Mesh Attention Only

Isolates Innovation 1. Decay and exit disabled. PPL: 37.8. Compute: 0.62×.

Notebook 03

Full TMT

All 3 innovations active. The main result. PPL: 29.4. Compute: 0.48×.

Notebook 04

Comparison

All 8 ablation configs in one table + bar chart. Fill PPL values after running 01–03.

source .venv/bin/activate
pip install jupyter matplotlib pandas
jupyter notebook tmt/experiments/01_baseline.ipynb

related work

Literature Context

Where TMT sits relative to the five closest lines of prior work.

Paper	Year	Core Idea	TMT Relation
Vaswani et al. — Attention is All You Need	2017	Transformer baseline	TMT base architecture
Su et al. — RoFormer (RoPE)	2021	Rotary positional encoding	TMT uses RoPE, extends with learned decay
Velickovic et al. — Graph Attention Networks	2018	Attention on fixed graphs	TMT uses dynamic topology, not fixed
Elbayad et al. — Depth-Adaptive Transformer	2020	Early exit for classification	TMT extends to generation, per-token
Graves — Adaptive Computation Time	2016	Halt RNN tokens early	TMT is the transformer equivalent
Weston et al. — Memory Networks	2015	External memory for QA	TMT uses EMA-updated persistent anchors
Vigneshwar — TemporalMesh Transformer	2026	All five combined in one model	This work

datasets

TMT-Benchmarks Dataset

135+ downloads. Evaluation and testing dataset for all TMT ablation experiments.

📊

ablation_reference

8 rows — canonical perplexity for all 8 ablation configurations. The gold standard results table.

📏

length_scaling

1,200 rows — O(S²) vs O(S·k) complexity comparison at S=32 through S=1024.

🧠

complexity_test

1,000 token complexity annotations with expected exit layer assignments.

🚪

exit_gate_reference

Exit layer distributions by token type — punctuation vs rare terms vs common words.

⚠

edge_case_inputs

15 boundary and adversarial inputs for robustness testing of all modules.

from datasets import load_dataset
ds = load_dataset("vigneshwar234/TMT-Benchmarks", "ablation_reference")
print(ds["test"].to_pandas())   # PPL 29.4 vs 42.1 ablation table

citation

Cite This Work

If TMT helps your research, please cite the Zenodo preprint.

@article{vigneshwar2026temporalmesh,
  title   = {TemporalMesh Transformer: Dynamic Graph Attention with
             Temporal Decay and Adaptive Depth Routing},
  author  = {LK, Vigneshwar},
  journal = {Zenodo Preprint},
  year    = {2026},
  doi     = {10.5281/zenodo.20287197},
  url     = {https://zenodo.org/records/20287390}
}

📄 DOI Link Zenodo Record ⭐ Star on GitHub 🤗 HuggingFace

TemporalMeshTransformer