Four Pillars

Every data quality problem.
One library.

Each pillar is a standalone powerhouse. Together, they give you complete visibility and control over your data — from raw ingestion all the way to model output attribution.

Pillar 01

AutoRepair

An 8-phase engine that detects and heals nulls, outliers, type mismatches, duplicates, encoding corruption, and whitespace — automatically, zero config required.

Modified Z-score outlier detection (MAD-based, robust to skew)
Skewness-aware null imputation (mean vs median auto-selection)
Near-duplicate detection via Jaccard token similarity ≥ 0.85
Unicode NFKD category normalisation for messy strings
Mojibake (latin-1 → utf-8) encoding corruption repair
Hidden character and zero-width space removal
Unit mismatch flagging (CV + IQR ratio heuristics)

One-liner API

repaired, report = datamend.repair(df)

Pillar 02

DataContract

Automatically learns the full statistical fingerprint of your training data and validates every future batch against it — without you writing a single expectation manually.

Auto-learns dtype, nullable, null_rate, min/max/mean/std
Percentile fingerprint (p5/p25/p50/p75/p95)
KS-test distribution validation across numeric columns
Cardinality and category membership checks
JSON persistence for reproducible, version-controlled contracts
Raise-on-failure mode for strict gate pipelines
Severity levels: critical / high / medium / low per violation

Two-liner API

contract = datamend.contract(train_df)
report = datamend.validate(prod_df, contract)

Pillar 03

DriftRadar

Runs four independent statistical tests on every feature column and combines them into one drift verdict with severity scoring — so you know before your model performance drops.

PSI — Population Stability Index with auto percentile binning
KS test — Kolmogorov-Smirnov for continuous feature drift
Chi-square test — categorical distribution shift detection
JSD — Jensen-Shannon Divergence across all column types
Weighted combined drift score (0–100) per column
Severity: none / low / medium / high / critical
Per-column breakdown + overall dataset drift MendScore

One-liner API

report = datamend.drift(train_df, prod_df)

Pillar 04

FailureTrace

Combines data-quality signals, model confidence estimates, and surrogate importances to surface the exact rows and columns causing your predictions to fail.

Row-level suspicion score — composite of DQ + model signals
Native importance extraction for sklearn / XGBoost / LightGBM
Surrogate DecisionTreeRegressor for black-box model attribution
Model confidence via predict_proba or normalised residuals
Column attribution: model_importance × data_quality_contribution
Top-K suspicious rows with per-row explanation strings
Works with classifiers, regressors, and neural networks

One-liner API

report = datamend.trace(model, df, preds)

How It Works

One pipeline.
Zero compromises.

MendPipeline chains all four pillars in a stateful object — fit once on training data, run on every production batch in milliseconds.

📥

Raw Data In

Any pandas DataFrame — CSV, Parquet, JSON. Dirty data welcome.

→

🔧

AutoRepair

8-phase repair engine heals every corruption type. MendScore computed.

→

📋

DataContract

Validates repaired data against learned schema. Violations surfaced.

→

📡

DriftRadar

PSI + KS + chi² + JSD compared against training distribution.

→

🔬

FailureTrace

Suspicious rows and columns attributed. Root cause identified.

→

✓

PipelineResult

Clean DataFrame + all reports + overall MendScore. Export to JSON or HTML.

          
          
          
          
        

          from datamend import MendPipeline

pipeline = MendPipeline(
    repair_strategy="auto",    # auto-selects imputation per column
    null_threshold=0.05,        # max 5% nulls allowed by contract
    drift_alpha=0.05,            # significance level for KS / chi2
    psi_buckets=10,              # bins for PSI computation
    top_k_trace=10,              # top suspicious rows to surface
    verbose=True,
)

# ── Fit once on clean training data ──────────────────────────────
pipeline.fit(train_df)

# ── Run on every production batch ────────────────────────────────
result = pipeline.transform(prod_df, model=model, predictions=preds)

print(result.overall_mend_score)          # 91.4
print(result.repair_report.mend_score_after)  # 96.8

result.repaired_df.to_parquet("clean_batch.parquet")
result.to_json()    # full JSON-serializable report
        

          import datamend

# Each pillar works fully standalone

# Repair with explicit strategy
repaired, report = datamend.repair(df, strategy="median", verbose=True)

# Fit and save a versioned contract
contract = datamend.contract(train_df)
contract.save("contracts/v1.json")   # commit this to git!

# Load and strictly enforce
contract = datamend.DataContract.load("contracts/v1.json")
report = datamend.validate(prod_df, contract, raise_on_failure=True)

# Drift on specific columns only
drift = datamend.drift(train_df, prod_df, columns=["age", "income"])

# Trace with ground truth labels
trace = datamend.trace(model, prod_df, preds, ground_truth=y_true)
        

          # Full CLI — supports CSV, Parquet, JSON, Excel

# Repair a dirty file and save output
datamend repair data.csv -o clean.csv --strategy median --verbose

# Fit a contract from training data
datamend contract train.csv -o contracts/v1.json

# Validate production data against saved contract
datamend validate prod.csv --contract contracts/v1.json

# Detect drift between two datasets
datamend drift train.csv prod.csv --alpha 0.01 --columns age income score

# Get a quick MendScore without running full repair
datamend score data.csv

# Generate full HTML dashboard and open in browser
datamend dashboard data.csv -o report.html --open

# List all registered plugins
datamend plugins list
        

          from datamend import AutoRepair

# Chunked mode — handles 50M+ rows without OOM
engine = AutoRepair(strategy="median", fast_mode=True)
repaired, report = engine.repair_chunked(
    df,
    chunk_size=1_000_000,   # 1M rows per chunk
)
print(f"Processed: {len(repaired):,} rows")

# Async concurrent processing of multiple batches
import asyncio

async def process(batch):
    loop = asyncio.get_event_loop()
    return await loop.run_in_executor(None, datamend.repair, batch)

results = await asyncio.gather(*[process(b) for b in batches])
        

Documentation

Everything you need to
ship with confidence

Complete API reference, tutorials, algorithm explanations, and real-world patterns — all in one place.

🔧

AutoRepair

datamend.core.repair

Class: AutoRepair

AutoRepair(strategy="auto", fast_mode=False, plugins=[], verbose=True)

strategy str "auto" | "mean" | "median" | "mode". auto selects based on column skewness (>1.0 → median, else mean).

fast_mode bool Enable sampling and faster heuristics for very large datasets (>5M rows).

plugins list List of BaseRepairPlugin instances to run after the 8-phase engine.

verbose bool Print rich-formatted summary to stdout after repair.

Methods

fit_transform(df) → (DataFrame, RepairReport)

Run all 8 detection phases on df and return repaired DataFrame + full report.

repair_chunked(df, chunk_size=500_000) → (DataFrame, RepairReport)

Process df in chunks. Merges per-chunk reports and concatenates repaired frames.

RepairReport fields

total_issues_foundint — total number of issues detected and fixed

total_rows_affectedint — number of rows changed in any phase

actionsList[RepairAction] — one entry per fix applied

columns_repairedList[str] — columns that had at least one fix

mend_score_beforefloat — data health score before repair (0–100)

mend_score_afterfloat — data health score after repair (0–100)

duration_secondsfloat — wall-clock time for the full repair pass

Example

from datamend import AutoRepair

engine = AutoRepair(strategy="auto", verbose=False)
repaired, report = engine.fit_transform(df)

for action in report.actions:
    print(f"[{action.column}] {action.issue_type}: {action.rows_affected} rows")

📋

DataContract

datamend.core.contract

Class: DataContract

DataContract(null_threshold=0.05)

null_threshold float Maximum fraction of null values allowed per column (0–1). Default 0.05 (5%).

Methods

fit(df) → self

Learn statistical fingerprint from training data. Stores ColumnSpec per column.

validate(df, raise_on_failure=False) → ContractReport

Run 7 validation checks. Returns report with violations list and passed flag.

save(path) / load(path)

Persist contract as JSON for version control and reproducible validation.

Validation Checks (in order)

CRITICALMissing required columns

HIGHNull rate exceeds threshold

HIGHdtype mismatch with expected type

MEDIUMNumeric values outside training range

MEDIUMKS-test distribution shift (p < 0.01)

LOWCardinality change > 50% from training

LOWExtra unexpected columns found

Example

from datamend import DataContract

contract = DataContract(null_threshold=0.02)
contract.fit(train_df)
contract.save("contracts/prod_v1.json")

report = contract.validate(prod_df)
if not report.passed:
    for v in report.violations:
        print(f"[{v.severity}] {v.column}: {v.message}")

📡

DriftRadar

datamend.core.drift

Class: DriftRadar

DriftRadar(psi_buckets=10, alpha=0.05, verbose=True)

psi_buckets int Number of percentile bins for PSI computation. Higher = finer-grained. Default 10.

alpha float Significance level for KS and chi-square p-value threshold. Default 0.05.

Statistical Tests Explained

PSI

Population Stability Index. Compares bin frequencies between train and prod. PSI > 0.25 = significant drift. Used in credit scoring models for decades.

PSI = Σ(A% - E%) × ln(A% / E%)

KS Test

Kolmogorov-Smirnov test. Measures maximum distance between empirical CDFs. Non-parametric — no distribution assumption needed.

D = max|F_train(x) - F_prod(x)|

Chi-Square

For categorical columns. Compares observed vs expected category frequencies. Flags columns where the category mix has changed significantly.

χ² = Σ (O - E)² / E

JSD

Jensen-Shannon Divergence. Symmetric, bounded (0–1), works on both continuous and categorical columns. Complementary to KS and PSI.

JSD = ½KL(P||M) + ½KL(Q||M)

ColumnDriftResult fields

psifloat — Population Stability Index

ks_stat / ks_pvaluefloat — KS statistic and p-value

chi2_stat / chi2_pvaluefloat — chi-square stat and p-value

jsdfloat — Jensen-Shannon Divergence (0–1)

drift_scorefloat — combined weighted score (0–100)

driftedbool — True if any test flags drift

severitystr — "none" | "low" | "medium" | "high" | "critical"

Example

from datamend import DriftRadar

radar = DriftRadar(psi_buckets=20, alpha=0.01)
report = radar.detect(train_df, prod_df)

for col, r in report.column_results.items():
    if r.drifted:
        print(f"{col}: PSI={r.psi:.3f} severity={r.severity}")

🔬

FailureTrace

datamend.core.trace

Class: FailureTrace

FailureTrace(top_k=10, verbose=True)

top_k int Number of most suspicious rows to include in TraceReport. Default 10.

Suspicion Score Formula

suspicion = 0.50 × dq_suspicion + 0.30 × weighted_anomaly + 0.20 × model_suspicion

dq_suspicion = 1 - (data quality score per row, penalises nulls/outliers/encoding issues)
weighted_anomaly = feature-importance-weighted column anomaly rate
model_suspicion = 1 - model confidence (1 - max predict_proba for classifiers)

RowFailure fields

row_indexint — index in the original DataFrame

suspicion_scorefloat — composite suspicion (0–1, higher = more suspicious)

top_columnsList[str] — columns most responsible for this row's failure

data_quality_scorefloat — per-row DQ score (0–1, lower = worse quality)

model_confidencefloat — model confidence for this row (0–1)

reasonstr — human-readable explanation of why this row is suspicious

Example

from datamend import FailureTrace

tracer = FailureTrace(top_k=20)
report = tracer.trace(model, df, predictions, ground_truth=y_true)

for row in report.suspicious_rows:
    print(f"Row {row.row_index}: suspicion={row.suspicion_score:.3f}")
    print(f"  Columns: {row.top_columns[:3]}")
    print(f"  Reason: {row.reason}")

🚀

MendPipeline

datamend.pipeline — All four pillars unified

Constructor Parameters

repair_strategystr"auto" | "mean" | "median" | "mode"

null_thresholdfloatContract max null rate (0–1). Default 0.05

drift_alphafloatKS/chi² significance level. Default 0.05

psi_bucketsintPSI percentile bins. Default 10

top_k_traceintSuspicious rows to surface. Default 10

enable_repairboolRun AutoRepair pillar. Default True

enable_contractboolRun DataContract pillar. Default True

enable_driftboolRun DriftRadar pillar. Default True

enable_traceboolRun FailureTrace pillar. Default True

fast_modeboolSampling for large datasets. Default False

verboseboolPrint pillar summaries to stdout

pluginslistCustom BaseRepairPlugin instances

Methods

fit(train_df) → self

Repairs training data, learns contract, stores clean reference for drift comparison.

transform(df, model=None, predictions=None, ground_truth=None) → PipelineResult

Runs all enabled pillars on new data. Returns PipelineResult with everything.

fit_transform(train_df, prod_df=None, ...) → PipelineResult

Convenience wrapper: fit + transform in one call. If prod_df is None, analyses train_df itself.

Overall MendScore Formula

score = 0.35 × repair_after + 0.30 × contract_score + 0.20 × (100 - drift_score) + 0.15 × (100 - trace_score)

🔌

Plugin System

datamend.plugins.base

Extend datamend with custom repair logic using the plugin interface. Plugins run after the 8 built-in phases and integrate fully with MendReport and the CLI.

Creating a Plugin

from datamend.plugins.base import BaseRepairPlugin, register_plugin
from datamend.core.repair import RepairAction

@register_plugin
class ClipNegativePlugin(BaseRepairPlugin):
    name = "clip_negative"
    description = "Clips negative values to 0"

    def repair(self, df):
        df, actions = df.copy(), []
        for col in df.select_dtypes("number").columns:
            mask = df[col] < 0
            count = mask.sum()
            if count:
                df.loc[mask, col] = 0
                actions.append(RepairAction(
                    column=col, issue_type="NEGATIVE_VALUE",
                    description=f"Clipped {count} negatives to 0",
                    rows_affected=int(count),
                    before_sample=None, after_sample=None,
                    strategy="clip_negative",
                ))
        return df, actions

# Use it
repaired, report = datamend.repair(df, plugins=[ClipNegativePlugin()])

Auto-Discovery via Entry Points

# In your library's pyproject.toml:
[project.entry-points."datamend.plugins"]
my_plugin = "my_package.plugins:MyPlugin"

# datamend discovers it automatically on install
from datamend.plugins.base import get_registry
registry = get_registry()
registry.auto_discover()
print(registry.list_plugins())  # includes "my_plugin"

🖥️

HTML Dashboard

datamend.report — MendReport

Generate a self-contained, dark-mode HTML dashboard from any combination of pillar reports. No server. No internet. No CDN dependencies — one single .html file.

Class: MendReport

MendReport(repair_report=None, contract_report=None, drift_report=None, trace_report=None)

to_html(path) → str

Write single-file dashboard to disk. Returns the rendered HTML string.

serve(port=8080, open_browser=True)

Spin up a local HTTP server and open the dashboard in your browser.

Example

from datamend import MendReport

report = MendReport(
    repair_report=repair_report,
    contract_report=contract_report,
    drift_report=drift_report,
    trace_report=trace_report,
)

report.to_html("dashboard.html")       # save to disk
report.serve(port=8080)               # open in browser

Task	datamend	pandas (manual)	Great Expectations	Evidently	SHAP
Null detection + imputation	0.12s	0.08s	—	—	—
Outlier detection + IQR clip	0.31s	~1.2s manual	—	—	—
Duplicate removal (exact + near)	0.09s	0.07s (exact only)	—	—	—
Full 8-phase data repair	0.61s	~4s manual setup	—	—	—
Contract fit	0.18s	—	~2.1s	—	—
Contract validation	0.11s	—	~0.9s	—	—
Drift detection (10 columns)	0.29s	—	—	~0.8s	—
Failure trace (RandomForest)	1.14s	—	—	—	~8.2s
Full pipeline (all 4 pillars)	2.1s	~7s+ combined	No equivalent — separate tools required

Feature	datamend	pandas	GX	Evidently	SHAP
Auto null imputation	✅	⚠️ manual	❌	❌	❌
Outlier repair	✅	⚠️ manual	❌	❌	❌
Encoding corruption fix	✅	❌	❌	❌	❌
Auto-learned contracts	✅	❌	❌	❌	❌
Statistical contract (KS)	✅	❌	❌	❌	❌
PSI drift detection	✅	❌	❌	✅	❌
JSD drift detection	✅	❌	❌	❌	❌
Row-level suspicion scores	✅	❌	❌	❌	❌
MendScore composite metric	✅	❌	❌	❌	❌
Offline HTML dashboard	✅	❌	⚠️ server	✅	❌
MLflow / W&B integration	✅	❌	❌	⚠️	❌
Plugin system	✅	❌	✅	❌	❌
Lines of code to get started	1	50+	20+	10+	10+

Stop babysitting dirty data.

From messy data to clean insightsin five lines

Every data quality problem.One library.

AutoRepair

DataContract

DriftRadar

FailureTrace

One pipeline.Zero compromises.

Raw Data In

AutoRepair

DataContract

DriftRadar

FailureTrace

PipelineResult

Everything you need toship with confidence

AutoRepair

Class: AutoRepair

Methods

RepairReport fields

DataContract

Class: DataContract

Methods

Validation Checks (in order)

DriftRadar

Class: DriftRadar

Statistical Tests Explained

ColumnDriftResult fields

FailureTrace

Class: FailureTrace

Suspicion Score Formula

RowFailure fields

MendPipeline

Constructor Parameters

Methods

Overall MendScore Formula

Plugin System

Creating a Plugin

Auto-Discovery via Entry Points

HTML Dashboard

Class: MendReport

One number tellsyou everything

Fast enough for production.Thorough enough for research.

Why not just use the existing tools?

Feature Matrix

Fits into your existing stack

MLflow

Weights & Biases

DVC

scikit-learn

XGBoost

PyTorch

Ready to fix your data pipeline?

Stop babysitting
dirty data.

From messy data to clean insights
in five lines

Every data quality problem.
One library.

One pipeline.
Zero compromises.

Everything you need to
ship with confidence

One number tells
you everything

Fast enough for production.
Thorough enough for research.