v0.1.0 is live on PyPI · 113 tests passing

Stop babysitting
dirty data.

datamend is the production-grade Python library that automatically repairs corrupt data, enforces statistical contracts, detects distribution drift, and traces model failures — in one unified API, zero configuration.

$ pip install datamend
113 Tests Passing
94% Coverage
4 Pillars
15+ Detectors
MIT License

Works seamlessly with

pandas · scikit-learn · XGBoost · LightGBM · PyTorch · MLflow · Weights & Biases · DVC
Quick Start

From messy data to clean insights
in five lines

datamend's top-level API is designed to be learned in minutes and relied on for years.

quickstart.py
import pandas as pd
import datamend

df = pd.read_csv("production_data.csv")   # 50k rows, real mess

# ── Pillar 1: Repair — heals nulls, outliers, duplicates, encoding ────
repaired, report = datamend.repair(df)
print(report.mend_score_after)      # 96.8/100  ✓
print(report.total_issues_found)    # 247

# ── Pillar 2: Contract — auto-learned schema validation ───────────────
contract = datamend.contract(train_df)
violations = datamend.validate(repaired, contract)
print(violations.passed)           # True  ✓

# ── Pillar 3: Drift — four statistical tests in one call ──────────────
drift = datamend.drift(train_df, repaired)
print(drift.columns_drifted)       # ['income']  ⚠

# ── Pillar 4: Trace — which rows and columns caused failures ──────────
trace = datamend.trace(model, repaired, predictions)
print(trace.suspicious_rows[:3])   # [1042, 887, 3310]  ⚠
Four Pillars

Every data quality problem.
One library.

Each pillar is a standalone powerhouse. Together, they give you complete visibility and control over your data — from raw ingestion all the way to model output attribution.

Pillar 01

AutoRepair

An 8-phase engine that detects and heals nulls, outliers, type mismatches, duplicates, encoding corruption, and whitespace — automatically, zero config required.

  • Modified Z-score outlier detection (MAD-based, robust to skew)
  • Skewness-aware null imputation (mean vs median auto-selection)
  • Near-duplicate detection via Jaccard token similarity ≥ 0.85
  • Unicode NFKD category normalisation for messy strings
  • Mojibake (latin-1 → utf-8) encoding corruption repair
  • Hidden character and zero-width space removal
  • Unit mismatch flagging (CV + IQR ratio heuristics)
One-liner API
repaired, report = datamend.repair(df)
Pillar 02

DataContract

Automatically learns the full statistical fingerprint of your training data and validates every future batch against it — without you writing a single expectation manually.

  • Auto-learns dtype, nullable, null_rate, min/max/mean/std
  • Percentile fingerprint (p5/p25/p50/p75/p95)
  • KS-test distribution validation across numeric columns
  • Cardinality and category membership checks
  • JSON persistence for reproducible, version-controlled contracts
  • Raise-on-failure mode for strict gate pipelines
  • Severity levels: critical / high / medium / low per violation
Two-liner API
contract = datamend.contract(train_df)
report = datamend.validate(prod_df, contract)
Pillar 03

DriftRadar

Runs four independent statistical tests on every feature column and combines them into one drift verdict with severity scoring — so you know before your model performance drops.

  • PSI — Population Stability Index with auto percentile binning
  • KS test — Kolmogorov-Smirnov for continuous feature drift
  • Chi-square test — categorical distribution shift detection
  • JSD — Jensen-Shannon Divergence across all column types
  • Weighted combined drift score (0–100) per column
  • Severity: none / low / medium / high / critical
  • Per-column breakdown + overall dataset drift MendScore
One-liner API
report = datamend.drift(train_df, prod_df)
Pillar 04

FailureTrace

Combines data-quality signals, model confidence estimates, and surrogate importances to surface the exact rows and columns causing your predictions to fail.

  • Row-level suspicion score — composite of DQ + model signals
  • Native importance extraction for sklearn / XGBoost / LightGBM
  • Surrogate DecisionTreeRegressor for black-box model attribution
  • Model confidence via predict_proba or normalised residuals
  • Column attribution: model_importance × data_quality_contribution
  • Top-K suspicious rows with per-row explanation strings
  • Works with classifiers, regressors, and neural networks
One-liner API
report = datamend.trace(model, df, preds)
How It Works

One pipeline.
Zero compromises.

MendPipeline chains all four pillars in a stateful object — fit once on training data, run on every production batch in milliseconds.

📥

Raw Data In

Any pandas DataFrame — CSV, Parquet, JSON. Dirty data welcome.

🔧

AutoRepair

8-phase repair engine heals every corruption type. MendScore computed.

📋

DataContract

Validates repaired data against learned schema. Violations surfaced.

📡

DriftRadar

PSI + KS + chi² + JSD compared against training distribution.

🔬

FailureTrace

Suspicious rows and columns attributed. Root cause identified.

PipelineResult

Clean DataFrame + all reports + overall MendScore. Export to JSON or HTML.

from datamend import MendPipeline

pipeline = MendPipeline(
    repair_strategy="auto",    # auto-selects imputation per column
    null_threshold=0.05,        # max 5% nulls allowed by contract
    drift_alpha=0.05,            # significance level for KS / chi2
    psi_buckets=10,              # bins for PSI computation
    top_k_trace=10,              # top suspicious rows to surface
    verbose=True,
)

# ── Fit once on clean training data ──────────────────────────────
pipeline.fit(train_df)

# ── Run on every production batch ────────────────────────────────
result = pipeline.transform(prod_df, model=model, predictions=preds)

print(result.overall_mend_score)          # 91.4
print(result.repair_report.mend_score_after)  # 96.8

result.repaired_df.to_parquet("clean_batch.parquet")
result.to_json()    # full JSON-serializable report
import datamend

# Each pillar works fully standalone

# Repair with explicit strategy
repaired, report = datamend.repair(df, strategy="median", verbose=True)

# Fit and save a versioned contract
contract = datamend.contract(train_df)
contract.save("contracts/v1.json")   # commit this to git!

# Load and strictly enforce
contract = datamend.DataContract.load("contracts/v1.json")
report = datamend.validate(prod_df, contract, raise_on_failure=True)

# Drift on specific columns only
drift = datamend.drift(train_df, prod_df, columns=["age", "income"])

# Trace with ground truth labels
trace = datamend.trace(model, prod_df, preds, ground_truth=y_true)
# Full CLI — supports CSV, Parquet, JSON, Excel

# Repair a dirty file and save output
datamend repair data.csv -o clean.csv --strategy median --verbose

# Fit a contract from training data
datamend contract train.csv -o contracts/v1.json

# Validate production data against saved contract
datamend validate prod.csv --contract contracts/v1.json

# Detect drift between two datasets
datamend drift train.csv prod.csv --alpha 0.01 --columns age income score

# Get a quick MendScore without running full repair
datamend score data.csv

# Generate full HTML dashboard and open in browser
datamend dashboard data.csv -o report.html --open

# List all registered plugins
datamend plugins list
from datamend import AutoRepair

# Chunked mode — handles 50M+ rows without OOM
engine = AutoRepair(strategy="median", fast_mode=True)
repaired, report = engine.repair_chunked(
    df,
    chunk_size=1_000_000,   # 1M rows per chunk
)
print(f"Processed: {len(repaired):,} rows")

# Async concurrent processing of multiple batches
import asyncio

async def process(batch):
    loop = asyncio.get_event_loop()
    return await loop.run_in_executor(None, datamend.repair, batch)

results = await asyncio.gather(*[process(b) for b in batches])
Documentation

Everything you need to
ship with confidence

Complete API reference, tutorials, algorithm explanations, and real-world patterns — all in one place.

🔧

AutoRepair

datamend.core.repair

Class: AutoRepair

AutoRepair(strategy="auto", fast_mode=False, plugins=[], verbose=True)
strategy str "auto" | "mean" | "median" | "mode". auto selects based on column skewness (>1.0 → median, else mean).
fast_mode bool Enable sampling and faster heuristics for very large datasets (>5M rows).
plugins list List of BaseRepairPlugin instances to run after the 8-phase engine.
verbose bool Print rich-formatted summary to stdout after repair.

Methods

fit_transform(df) → (DataFrame, RepairReport)

Run all 8 detection phases on df and return repaired DataFrame + full report.

repair_chunked(df, chunk_size=500_000) → (DataFrame, RepairReport)

Process df in chunks. Merges per-chunk reports and concatenates repaired frames.

RepairReport fields

total_issues_foundint — total number of issues detected and fixed
total_rows_affectedint — number of rows changed in any phase
actionsList[RepairAction] — one entry per fix applied
columns_repairedList[str] — columns that had at least one fix
mend_score_beforefloat — data health score before repair (0–100)
mend_score_afterfloat — data health score after repair (0–100)
duration_secondsfloat — wall-clock time for the full repair pass
Example
from datamend import AutoRepair

engine = AutoRepair(strategy="auto", verbose=False)
repaired, report = engine.fit_transform(df)

for action in report.actions:
    print(f"[{action.column}] {action.issue_type}: {action.rows_affected} rows")
📋

DataContract

datamend.core.contract

Class: DataContract

DataContract(null_threshold=0.05)
null_threshold float Maximum fraction of null values allowed per column (0–1). Default 0.05 (5%).

Methods

fit(df) → self

Learn statistical fingerprint from training data. Stores ColumnSpec per column.

validate(df, raise_on_failure=False) → ContractReport

Run 7 validation checks. Returns report with violations list and passed flag.

save(path) / load(path)

Persist contract as JSON for version control and reproducible validation.

Validation Checks (in order)

CRITICALMissing required columns
HIGHNull rate exceeds threshold
HIGHdtype mismatch with expected type
MEDIUMNumeric values outside training range
MEDIUMKS-test distribution shift (p < 0.01)
LOWCardinality change > 50% from training
LOWExtra unexpected columns found
Example
from datamend import DataContract

contract = DataContract(null_threshold=0.02)
contract.fit(train_df)
contract.save("contracts/prod_v1.json")

report = contract.validate(prod_df)
if not report.passed:
    for v in report.violations:
        print(f"[{v.severity}] {v.column}: {v.message}")
📡

DriftRadar

datamend.core.drift

Class: DriftRadar

DriftRadar(psi_buckets=10, alpha=0.05, verbose=True)
psi_buckets int Number of percentile bins for PSI computation. Higher = finer-grained. Default 10.
alpha float Significance level for KS and chi-square p-value threshold. Default 0.05.

Statistical Tests Explained

PSI

Population Stability Index. Compares bin frequencies between train and prod. PSI > 0.25 = significant drift. Used in credit scoring models for decades.

PSI = Σ(A% - E%) × ln(A% / E%)
KS Test

Kolmogorov-Smirnov test. Measures maximum distance between empirical CDFs. Non-parametric — no distribution assumption needed.

D = max|F_train(x) - F_prod(x)|
Chi-Square

For categorical columns. Compares observed vs expected category frequencies. Flags columns where the category mix has changed significantly.

χ² = Σ (O - E)² / E
JSD

Jensen-Shannon Divergence. Symmetric, bounded (0–1), works on both continuous and categorical columns. Complementary to KS and PSI.

JSD = ½KL(P||M) + ½KL(Q||M)

ColumnDriftResult fields

psifloat — Population Stability Index
ks_stat / ks_pvaluefloat — KS statistic and p-value
chi2_stat / chi2_pvaluefloat — chi-square stat and p-value
jsdfloat — Jensen-Shannon Divergence (0–1)
drift_scorefloat — combined weighted score (0–100)
driftedbool — True if any test flags drift
severitystr — "none" | "low" | "medium" | "high" | "critical"
Example
from datamend import DriftRadar

radar = DriftRadar(psi_buckets=20, alpha=0.01)
report = radar.detect(train_df, prod_df)

for col, r in report.column_results.items():
    if r.drifted:
        print(f"{col}: PSI={r.psi:.3f} severity={r.severity}")
🔬

FailureTrace

datamend.core.trace

Class: FailureTrace

FailureTrace(top_k=10, verbose=True)
top_k int Number of most suspicious rows to include in TraceReport. Default 10.

Suspicion Score Formula

suspicion = 0.50 × dq_suspicion + 0.30 × weighted_anomaly + 0.20 × model_suspicion
dq_suspicion = 1 - (data quality score per row, penalises nulls/outliers/encoding issues)
weighted_anomaly = feature-importance-weighted column anomaly rate
model_suspicion = 1 - model confidence (1 - max predict_proba for classifiers)

RowFailure fields

row_indexint — index in the original DataFrame
suspicion_scorefloat — composite suspicion (0–1, higher = more suspicious)
top_columnsList[str] — columns most responsible for this row's failure
data_quality_scorefloat — per-row DQ score (0–1, lower = worse quality)
model_confidencefloat — model confidence for this row (0–1)
reasonstr — human-readable explanation of why this row is suspicious
Example
from datamend import FailureTrace

tracer = FailureTrace(top_k=20)
report = tracer.trace(model, df, predictions, ground_truth=y_true)

for row in report.suspicious_rows:
    print(f"Row {row.row_index}: suspicion={row.suspicion_score:.3f}")
    print(f"  Columns: {row.top_columns[:3]}")
    print(f"  Reason: {row.reason}")
🚀

MendPipeline

datamend.pipeline — All four pillars unified

Constructor Parameters

repair_strategystr"auto" | "mean" | "median" | "mode"
null_thresholdfloatContract max null rate (0–1). Default 0.05
drift_alphafloatKS/chi² significance level. Default 0.05
psi_bucketsintPSI percentile bins. Default 10
top_k_traceintSuspicious rows to surface. Default 10
enable_repairboolRun AutoRepair pillar. Default True
enable_contractboolRun DataContract pillar. Default True
enable_driftboolRun DriftRadar pillar. Default True
enable_traceboolRun FailureTrace pillar. Default True
fast_modeboolSampling for large datasets. Default False
verboseboolPrint pillar summaries to stdout
pluginslistCustom BaseRepairPlugin instances

Methods

fit(train_df) → self

Repairs training data, learns contract, stores clean reference for drift comparison.

transform(df, model=None, predictions=None, ground_truth=None) → PipelineResult

Runs all enabled pillars on new data. Returns PipelineResult with everything.

fit_transform(train_df, prod_df=None, ...) → PipelineResult

Convenience wrapper: fit + transform in one call. If prod_df is None, analyses train_df itself.

Overall MendScore Formula

score = 0.35 × repair_after       + 0.30 × contract_score       + 0.20 × (100 - drift_score)       + 0.15 × (100 - trace_score)
🔌

Plugin System

datamend.plugins.base

Extend datamend with custom repair logic using the plugin interface. Plugins run after the 8 built-in phases and integrate fully with MendReport and the CLI.

Creating a Plugin

from datamend.plugins.base import BaseRepairPlugin, register_plugin
from datamend.core.repair import RepairAction

@register_plugin
class ClipNegativePlugin(BaseRepairPlugin):
    name = "clip_negative"
    description = "Clips negative values to 0"

    def repair(self, df):
        df, actions = df.copy(), []
        for col in df.select_dtypes("number").columns:
            mask = df[col] < 0
            count = mask.sum()
            if count:
                df.loc[mask, col] = 0
                actions.append(RepairAction(
                    column=col, issue_type="NEGATIVE_VALUE",
                    description=f"Clipped {count} negatives to 0",
                    rows_affected=int(count),
                    before_sample=None, after_sample=None,
                    strategy="clip_negative",
                ))
        return df, actions

# Use it
repaired, report = datamend.repair(df, plugins=[ClipNegativePlugin()])

Auto-Discovery via Entry Points

# In your library's pyproject.toml:
[project.entry-points."datamend.plugins"]
my_plugin = "my_package.plugins:MyPlugin"

# datamend discovers it automatically on install
from datamend.plugins.base import get_registry
registry = get_registry()
registry.auto_discover()
print(registry.list_plugins())  # includes "my_plugin"
🖥️

HTML Dashboard

datamend.report — MendReport

Generate a self-contained, dark-mode HTML dashboard from any combination of pillar reports. No server. No internet. No CDN dependencies — one single .html file.

Class: MendReport

MendReport(repair_report=None, contract_report=None, drift_report=None, trace_report=None)
to_html(path) → str

Write single-file dashboard to disk. Returns the rendered HTML string.

serve(port=8080, open_browser=True)

Spin up a local HTTP server and open the dashboard in your browser.

Example
from datamend import MendReport

report = MendReport(
    repair_report=repair_report,
    contract_report=contract_report,
    drift_report=drift_report,
    trace_report=trace_report,
)

report.to_html("dashboard.html")       # save to disk
report.serve(port=8080)               # open in browser
MendScore

One number tells
you everything

MendScore is a composite 0–100 health metric computed on every run. It weights null rate, outlier rate, duplicate rate, and whitespace contamination into a single actionable score you can track over time, log to MLflow/W&B, and set threshold alerts on.

40%
Null rate penalty
25%
Outlier rate penalty
20%
Duplicate rate penalty
15%
Whitespace rate penalty
95–100Excellent — production-ready
85–94Good — acceptable for most models
70–84Fair — repair recommended
50–69Poor — repair required
0–49Critical — stop pipeline
0 /100

MendScore after repair on a sample dirty dataset

Before repair
54.2
After repair
96.8
Performance

Fast enough for production.
Thorough enough for research.

Benchmarked on a 100,000-row × 20-column dataset · MacBook Pro M2 · Python 3.11 · average of 5 runs

Task datamend pandas (manual) Great Expectations Evidently SHAP
Null detection + imputation 0.12s 0.08s
Outlier detection + IQR clip 0.31s ~1.2s manual
Duplicate removal (exact + near) 0.09s 0.07s (exact only)
Full 8-phase data repair 0.61s ~4s manual setup
Contract fit 0.18s ~2.1s
Contract validation 0.11s ~0.9s
Drift detection (10 columns) 0.29s ~0.8s
Failure trace (RandomForest) 1.14s ~8.2s
Full pipeline (all 4 pillars) 2.1s ~7s+ combined No equivalent — separate tools required
datamend's FailureTrace is 7× faster than SHAP on the same dataset because it uses surrogate importances rather than marginal Shapley value computation. Benchmarks are indicative and vary by hardware, dataset shape, and Python version.
Comparison

Why not just use the existing tools?

Each alternative solves one slice of the problem. datamend solves all four — in one install, one API, one report. No stitching five libraries together.

pandasvsdatamend

pandas gives you the primitives. datamend gives you the automation. Writing manual null + outlier + duplicate detection code takes 200+ lines. datamend does it in one call, with a full audit trail of every fix and a composite health score.

Auto-selects imputation strategy MendScore metric Full audit trail
Great Expectationsvsdatamend

Great Expectations requires you to define every expectation manually. DataContract learns from your training data automatically — and validates against it in 2 lines of code, not 20. Includes KS-test distribution validation that GX doesn't offer out of the box.

Auto-learned contract KS distribution check Integrated repair
Evidentlyvsdatamend

Evidently is a great reporting tool but it doesn't repair data, enforce contracts, or trace failures. DriftRadar runs PSI + KS + chi² + JSD simultaneously — including JSD which Evidently doesn't offer — and combines them into a single severity score.

JSD test built-in Combined drift score Offline single-file dashboard
SHAPvsdatamend

SHAP explains feature importance globally. FailureTrace identifies the specific rows most likely to fail and fuses data quality signals with model confidence — running 7× faster by using surrogate importances instead of marginal Shapley value computation.

Row-level suspicion score DQ × model signal fusion 7× faster attribution

Feature Matrix

Feature datamend pandas GX Evidently SHAP
Auto null imputation⚠️ manual
Outlier repair⚠️ manual
Encoding corruption fix
Auto-learned contracts
Statistical contract (KS)
PSI drift detection
JSD drift detection
Row-level suspicion scores
MendScore composite metric
Offline HTML dashboard⚠️ server
MLflow / W&B integration⚠️
Plugin system
Lines of code to get started150+20+10+10+
Integrations

Fits into your existing stack

datamend integrates with the tools data scientists already use — experiment trackers, version control systems, and CI/CD pipelines.

MLflow

Log MendScore, repair counts, drift metrics, and full pipeline reports as JSON artifacts in any MLflow run.

from datamend.integrations.mlflow import log_pipeline_result
with mlflow.start_run():
    log_pipeline_result(result)

Weights & Biases

Stream data quality metrics alongside model performance to your W&B dashboard in real time.

from datamend.integrations.wandb import log_repair
wandb.init(project="my-ml-project")
log_repair(repair_report)

DVC

Save repair and drift metrics as DVC-tracked JSON files alongside your data pipeline for reproducibility.

from datamend.integrations.dvc import save_repair_metrics
save_repair_metrics(report, path="metrics/repair.json")

scikit-learn

Native feature importance extraction from any sklearn estimator with .feature_importances_ or .coef_ attributes.

from sklearn.ensemble import RandomForestClassifier
report = datamend.trace(rf_model, df, preds)

XGBoost

Plug in XGBClassifier or XGBRegressor. datamend extracts feature importances natively — no wrappers needed.

from xgboost import XGBRegressor
report = datamend.trace(xgb_model, df, preds)

PyTorch

Works with any nn.Module. Surrogate DecisionTreeRegressor used for attribution when native importances are unavailable.

import torch.nn as nn
report = datamend.trace(my_net, df, preds)

Ready to fix your data pipeline?

Install datamend in 10 seconds. Get your first MendScore in 30 seconds. Save 10+ hours every week.

$ pip install datamend