Novel Architecture · 2026

TemporalMesh Transformer

The first transformer to combine dynamic graph topology, token-level adaptive compute, and temporal semantic decay into a single unified model — built from scratch in PyTorch.

⭐ View on GitHub Get Started →
3
Core Innovations
~50%
Avg Compute Saved
15
Tests Passing
9
Model Components

What Makes TMT Different

Every existing transformer makes three assumptions: all tokens are equally important, the sequence is flat, and every token deserves the same compute budget. TMT breaks all three at once. The graph topology changes every forward pass. Tokens exit at different depths. Semantically distant tokens fade. No prior architecture does all three together.


vs existing work

How TMT Compares

Every column below is a limitation in the existing architecture that TMT addresses.

Architecture Dynamic Graph Per-Token Depth Temporal Decay Persistent Memory Dual-Stream FFN
Standard Transformer (GPT) ✗ No ✗ No ✗ No ✗ No ✗ No
Graph Transformer ✗ Fixed only ✗ No ✗ No ✗ No ✗ No
Early Exit Transformer ✗ No ✓ Partial ✗ No ✗ No ✗ No
RoPE / ALiBi ✗ No ✗ No ✗ Position only ✗ No ✗ No
Mixture of Experts ✗ No ✗ Layer-level only ✗ No ✗ No ✗ No
TemporalMesh Transformer ✓ Dynamic ✓ Per-token ✓ Semantic ✓ EMA anchors ✓ Yes

Core Innovations

Three Ideas. One Model.

Each innovation solves a different fundamental limitation of the standard transformer.

Mesh Attention

Tokens are nodes in a graph. Edges are recomputed each forward pass using cosine similarity. Only top-k neighbours are connected. O(S·k) instead of O(S²). The graph topology itself changes as token meanings evolve through layers.

edge_index = topk(cosine_sim(X), k=8) attention over edges only — not full S×S

Temporal Decay

Each token carries a learned decay scalar. Semantically distant tokens are attenuated before they reach the attention layer. The decay is multiplied into attention weights — not added as a positional bias.

attn = softmax(QKᵀ/√d) × sigmoid(W_decay · t)

Adaptive Depth Routing

After each layer, a single linear → sigmoid gives each token a confidence score. If confidence > 0.85, that token freezes and skips all remaining layers. Easy tokens use 2 layers. Hard tokens use 12. ~50% compute saved.

conf = sigmoid(W_gate · x) if conf > 0.85: token exits

Architecture

How It Works

Every forward pass returns a TMTOutput dataclass — logits plus full diagnostic tensors.

input_ids (B, S) │TokenEmbedding → (B, S, D) │ TemporalPositionEncoder → RoPE + learned decay scalars (B, S, D) │ MeshBuilder → edge_index (2, E), edge_weight (E,) │ TMTLayer × 12: ├── MeshAttention graph-restricted + temporal decay ├── DualStreamFFN syntax stream + semantic stream → learned gate ├── ExitGate per-token confidence → freeze if > 0.85 └── MemoryAnchorCross cross-attend 16 persistent EMA anchors │ OutputProjection → (B, S, vocab_size) Returns TMTOutput: logits (B, S, V) exit_masks list[(B, S) bool] — one per layer confidences list[(B, S) float] — gate score per token graph_edges (edge_index, edge_weight) memory_state (16, D) decay_scalars (B, S, D)

Getting Started

Install & Run

Four steps from zero to a running model.

1

Clone the repository

git clone https://github.com/vignesh2027/TemporalMesh-Transformer.git
cd TemporalMesh-Transformer
2

Create a virtual environment

Required on macOS with Homebrew Python.

python3 -m venv .venv
source .venv/bin/activate
3

Install dependencies

pip install -r requirements.txt
4

Verify — all 15 tests should pass

pytest tests/ -v
5

Train a model

from tmt.model.config import TMTConfig
from tmt.training.trainer import TMTTrainer, TrainConfig
from tmt.data.dataset import load_text_dataset

cfg = TMTConfig(vocab_size=50258, d_model=256, n_heads=4, n_layers=4)
loaders = load_text_dataset('wikitext-2', seq_len=128, batch_size=8)
trainer = TMTTrainer(cfg, TrainConfig(total_steps=500),
                     loaders['train'], loaders['validation'])
trainer.train()

Experiments

Ablation Notebooks

Run these four notebooks in order to reproduce the ablation study and measure each innovation's contribution.

Notebook 01

Vanilla Baseline

Standard GPT-style transformer. Same parameter budget. Control group for comparison.

Notebook 02

Mesh Only

Mesh attention active, decay and exit gate disabled. Isolates Innovation 1.

Notebook 03

Full TMT

All three innovations active. The main experimental result.

Notebook 04

Compare

Perplexity table + bar chart across all three configurations.


Citation

Cite This Work

@misc{tmt2026,
  title  = {TemporalMesh Transformer: Dynamic Graph Attention with
            Temporal Decay and Adaptive Depth Routing},
  author = {Vignesh},
  year   = {2026},
  url    = {https://github.com/vignesh2027/TemporalMesh-Transformer}
}