datamend is the production-grade Python library that automatically repairs corrupt data, enforces statistical contracts, detects distribution drift, and traces model failures — in one unified API, zero configuration.
datamend's top-level API is designed to be learned in minutes and relied on for years.
import pandas as pd
import datamend
df = pd.read_csv("production_data.csv") # 50k rows, real mess
# ── Pillar 1: Repair — heals nulls, outliers, duplicates, encoding ────
repaired, report = datamend.repair(df)
print(report.mend_score_after) # 96.8/100 ✓
print(report.total_issues_found) # 247
# ── Pillar 2: Contract — auto-learned schema validation ───────────────
contract = datamend.contract(train_df)
violations = datamend.validate(repaired, contract)
print(violations.passed) # True ✓
# ── Pillar 3: Drift — four statistical tests in one call ──────────────
drift = datamend.drift(train_df, repaired)
print(drift.columns_drifted) # ['income'] ⚠
# ── Pillar 4: Trace — which rows and columns caused failures ──────────
trace = datamend.trace(model, repaired, predictions)
print(trace.suspicious_rows[:3]) # [1042, 887, 3310] ⚠
Each pillar is a standalone powerhouse. Together, they give you complete visibility and control over your data — from raw ingestion all the way to model output attribution.
An 8-phase engine that detects and heals nulls, outliers, type mismatches, duplicates, encoding corruption, and whitespace — automatically, zero config required.
repaired, report = datamend.repair(df)
Automatically learns the full statistical fingerprint of your training data and validates every future batch against it — without you writing a single expectation manually.
contract = datamend.contract(train_df)
report = datamend.validate(prod_df, contract)
Runs four independent statistical tests on every feature column and combines them into one drift verdict with severity scoring — so you know before your model performance drops.
report = datamend.drift(train_df, prod_df)
Combines data-quality signals, model confidence estimates, and surrogate importances to surface the exact rows and columns causing your predictions to fail.
report = datamend.trace(model, df, preds)
MendPipeline chains all four pillars in a stateful object — fit once on training data, run on every production batch in milliseconds.
Any pandas DataFrame — CSV, Parquet, JSON. Dirty data welcome.
8-phase repair engine heals every corruption type. MendScore computed.
Validates repaired data against learned schema. Violations surfaced.
PSI + KS + chi² + JSD compared against training distribution.
Suspicious rows and columns attributed. Root cause identified.
Clean DataFrame + all reports + overall MendScore. Export to JSON or HTML.
from datamend import MendPipeline
pipeline = MendPipeline(
repair_strategy="auto", # auto-selects imputation per column
null_threshold=0.05, # max 5% nulls allowed by contract
drift_alpha=0.05, # significance level for KS / chi2
psi_buckets=10, # bins for PSI computation
top_k_trace=10, # top suspicious rows to surface
verbose=True,
)
# ── Fit once on clean training data ──────────────────────────────
pipeline.fit(train_df)
# ── Run on every production batch ────────────────────────────────
result = pipeline.transform(prod_df, model=model, predictions=preds)
print(result.overall_mend_score) # 91.4
print(result.repair_report.mend_score_after) # 96.8
result.repaired_df.to_parquet("clean_batch.parquet")
result.to_json() # full JSON-serializable report
import datamend
# Each pillar works fully standalone
# Repair with explicit strategy
repaired, report = datamend.repair(df, strategy="median", verbose=True)
# Fit and save a versioned contract
contract = datamend.contract(train_df)
contract.save("contracts/v1.json") # commit this to git!
# Load and strictly enforce
contract = datamend.DataContract.load("contracts/v1.json")
report = datamend.validate(prod_df, contract, raise_on_failure=True)
# Drift on specific columns only
drift = datamend.drift(train_df, prod_df, columns=["age", "income"])
# Trace with ground truth labels
trace = datamend.trace(model, prod_df, preds, ground_truth=y_true)
# Full CLI — supports CSV, Parquet, JSON, Excel
# Repair a dirty file and save output
datamend repair data.csv -o clean.csv --strategy median --verbose
# Fit a contract from training data
datamend contract train.csv -o contracts/v1.json
# Validate production data against saved contract
datamend validate prod.csv --contract contracts/v1.json
# Detect drift between two datasets
datamend drift train.csv prod.csv --alpha 0.01 --columns age income score
# Get a quick MendScore without running full repair
datamend score data.csv
# Generate full HTML dashboard and open in browser
datamend dashboard data.csv -o report.html --open
# List all registered plugins
datamend plugins list
from datamend import AutoRepair
# Chunked mode — handles 50M+ rows without OOM
engine = AutoRepair(strategy="median", fast_mode=True)
repaired, report = engine.repair_chunked(
df,
chunk_size=1_000_000, # 1M rows per chunk
)
print(f"Processed: {len(repaired):,} rows")
# Async concurrent processing of multiple batches
import asyncio
async def process(batch):
loop = asyncio.get_event_loop()
return await loop.run_in_executor(None, datamend.repair, batch)
results = await asyncio.gather(*[process(b) for b in batches])
Complete API reference, tutorials, algorithm explanations, and real-world patterns — all in one place.
datamend.core.repair
AutoRepair(strategy="auto", fast_mode=False, plugins=[], verbose=True)
"auto" | "mean" | "median" | "mode". auto selects based on column skewness (>1.0 → median, else mean).
BaseRepairPlugin instances to run after the 8-phase engine.
fit_transform(df) → (DataFrame, RepairReport)
Run all 8 detection phases on df and return repaired DataFrame + full report.
repair_chunked(df, chunk_size=500_000) → (DataFrame, RepairReport)
Process df in chunks. Merges per-chunk reports and concatenates repaired frames.
total_issues_foundint — total number of issues detected and fixedtotal_rows_affectedint — number of rows changed in any phaseactionsList[RepairAction] — one entry per fix appliedcolumns_repairedList[str] — columns that had at least one fixmend_score_beforefloat — data health score before repair (0–100)mend_score_afterfloat — data health score after repair (0–100)duration_secondsfloat — wall-clock time for the full repair passfrom datamend import AutoRepair
engine = AutoRepair(strategy="auto", verbose=False)
repaired, report = engine.fit_transform(df)
for action in report.actions:
print(f"[{action.column}] {action.issue_type}: {action.rows_affected} rows")
datamend.core.contract
DataContract(null_threshold=0.05)
fit(df) → self
Learn statistical fingerprint from training data. Stores ColumnSpec per column.
validate(df, raise_on_failure=False) → ContractReport
Run 7 validation checks. Returns report with violations list and passed flag.
save(path) / load(path)
Persist contract as JSON for version control and reproducible validation.
from datamend import DataContract
contract = DataContract(null_threshold=0.02)
contract.fit(train_df)
contract.save("contracts/prod_v1.json")
report = contract.validate(prod_df)
if not report.passed:
for v in report.violations:
print(f"[{v.severity}] {v.column}: {v.message}")
datamend.core.drift
DriftRadar(psi_buckets=10, alpha=0.05, verbose=True)
Population Stability Index. Compares bin frequencies between train and prod. PSI > 0.25 = significant drift. Used in credit scoring models for decades.
PSI = Σ(A% - E%) × ln(A% / E%)Kolmogorov-Smirnov test. Measures maximum distance between empirical CDFs. Non-parametric — no distribution assumption needed.
D = max|F_train(x) - F_prod(x)|For categorical columns. Compares observed vs expected category frequencies. Flags columns where the category mix has changed significantly.
χ² = Σ (O - E)² / EJensen-Shannon Divergence. Symmetric, bounded (0–1), works on both continuous and categorical columns. Complementary to KS and PSI.
JSD = ½KL(P||M) + ½KL(Q||M)psifloat — Population Stability Indexks_stat / ks_pvaluefloat — KS statistic and p-valuechi2_stat / chi2_pvaluefloat — chi-square stat and p-valuejsdfloat — Jensen-Shannon Divergence (0–1)drift_scorefloat — combined weighted score (0–100)driftedbool — True if any test flags driftseveritystr — "none" | "low" | "medium" | "high" | "critical"from datamend import DriftRadar
radar = DriftRadar(psi_buckets=20, alpha=0.01)
report = radar.detect(train_df, prod_df)
for col, r in report.column_results.items():
if r.drifted:
print(f"{col}: PSI={r.psi:.3f} severity={r.severity}")
datamend.core.trace
FailureTrace(top_k=10, verbose=True)
row_indexint — index in the original DataFramesuspicion_scorefloat — composite suspicion (0–1, higher = more suspicious)top_columnsList[str] — columns most responsible for this row's failuredata_quality_scorefloat — per-row DQ score (0–1, lower = worse quality)model_confidencefloat — model confidence for this row (0–1)reasonstr — human-readable explanation of why this row is suspiciousfrom datamend import FailureTrace
tracer = FailureTrace(top_k=20)
report = tracer.trace(model, df, predictions, ground_truth=y_true)
for row in report.suspicious_rows:
print(f"Row {row.row_index}: suspicion={row.suspicion_score:.3f}")
print(f" Columns: {row.top_columns[:3]}")
print(f" Reason: {row.reason}")
datamend.pipeline — All four pillars unified
fit(train_df) → self
Repairs training data, learns contract, stores clean reference for drift comparison.
transform(df, model=None, predictions=None, ground_truth=None) → PipelineResult
Runs all enabled pillars on new data. Returns PipelineResult with everything.
fit_transform(train_df, prod_df=None, ...) → PipelineResult
Convenience wrapper: fit + transform in one call. If prod_df is None, analyses train_df itself.
score = 0.35 × repair_after
+ 0.30 × contract_score
+ 0.20 × (100 - drift_score)
+ 0.15 × (100 - trace_score)
datamend.plugins.base
Extend datamend with custom repair logic using the plugin interface. Plugins run after the 8 built-in phases and integrate fully with MendReport and the CLI.
from datamend.plugins.base import BaseRepairPlugin, register_plugin
from datamend.core.repair import RepairAction
@register_plugin
class ClipNegativePlugin(BaseRepairPlugin):
name = "clip_negative"
description = "Clips negative values to 0"
def repair(self, df):
df, actions = df.copy(), []
for col in df.select_dtypes("number").columns:
mask = df[col] < 0
count = mask.sum()
if count:
df.loc[mask, col] = 0
actions.append(RepairAction(
column=col, issue_type="NEGATIVE_VALUE",
description=f"Clipped {count} negatives to 0",
rows_affected=int(count),
before_sample=None, after_sample=None,
strategy="clip_negative",
))
return df, actions
# Use it
repaired, report = datamend.repair(df, plugins=[ClipNegativePlugin()])
# In your library's pyproject.toml:
[project.entry-points."datamend.plugins"]
my_plugin = "my_package.plugins:MyPlugin"
# datamend discovers it automatically on install
from datamend.plugins.base import get_registry
registry = get_registry()
registry.auto_discover()
print(registry.list_plugins()) # includes "my_plugin"
datamend.report — MendReport
Generate a self-contained, dark-mode HTML dashboard from any combination of pillar reports. No server. No internet. No CDN dependencies — one single .html file.
MendReport(repair_report=None, contract_report=None, drift_report=None, trace_report=None)
to_html(path) → str
Write single-file dashboard to disk. Returns the rendered HTML string.
serve(port=8080, open_browser=True)
Spin up a local HTTP server and open the dashboard in your browser.
from datamend import MendReport
report = MendReport(
repair_report=repair_report,
contract_report=contract_report,
drift_report=drift_report,
trace_report=trace_report,
)
report.to_html("dashboard.html") # save to disk
report.serve(port=8080) # open in browser
MendScore is a composite 0–100 health metric computed on every run. It weights null rate, outlier rate, duplicate rate, and whitespace contamination into a single actionable score you can track over time, log to MLflow/W&B, and set threshold alerts on.
MendScore after repair on a sample dirty dataset
Benchmarked on a 100,000-row × 20-column dataset · MacBook Pro M2 · Python 3.11 · average of 5 runs
| Task | datamend | pandas (manual) | Great Expectations | Evidently | SHAP |
|---|---|---|---|---|---|
| Null detection + imputation | 0.12s | 0.08s | — | — | — |
| Outlier detection + IQR clip | 0.31s | ~1.2s manual | — | — | — |
| Duplicate removal (exact + near) | 0.09s | 0.07s (exact only) | — | — | — |
| Full 8-phase data repair | 0.61s | ~4s manual setup | — | — | — |
| Contract fit | 0.18s | — | ~2.1s | — | — |
| Contract validation | 0.11s | — | ~0.9s | — | — |
| Drift detection (10 columns) | 0.29s | — | — | ~0.8s | — |
| Failure trace (RandomForest) | 1.14s | — | — | — | ~8.2s |
| Full pipeline (all 4 pillars) | 2.1s | ~7s+ combined | No equivalent — separate tools required | ||
Each alternative solves one slice of the problem. datamend solves all four — in one install, one API, one report. No stitching five libraries together.
pandas gives you the primitives. datamend gives you the automation. Writing manual null + outlier + duplicate detection code takes 200+ lines. datamend does it in one call, with a full audit trail of every fix and a composite health score.
Great Expectations requires you to define every expectation manually. DataContract learns from your training data automatically — and validates against it in 2 lines of code, not 20. Includes KS-test distribution validation that GX doesn't offer out of the box.
Evidently is a great reporting tool but it doesn't repair data, enforce contracts, or trace failures. DriftRadar runs PSI + KS + chi² + JSD simultaneously — including JSD which Evidently doesn't offer — and combines them into a single severity score.
SHAP explains feature importance globally. FailureTrace identifies the specific rows most likely to fail and fuses data quality signals with model confidence — running 7× faster by using surrogate importances instead of marginal Shapley value computation.
| Feature | datamend | pandas | GX | Evidently | SHAP |
|---|---|---|---|---|---|
| Auto null imputation | ✅ | ⚠️ manual | ❌ | ❌ | ❌ |
| Outlier repair | ✅ | ⚠️ manual | ❌ | ❌ | ❌ |
| Encoding corruption fix | ✅ | ❌ | ❌ | ❌ | ❌ |
| Auto-learned contracts | ✅ | ❌ | ❌ | ❌ | ❌ |
| Statistical contract (KS) | ✅ | ❌ | ❌ | ❌ | ❌ |
| PSI drift detection | ✅ | ❌ | ❌ | ✅ | ❌ |
| JSD drift detection | ✅ | ❌ | ❌ | ❌ | ❌ |
| Row-level suspicion scores | ✅ | ❌ | ❌ | ❌ | ❌ |
| MendScore composite metric | ✅ | ❌ | ❌ | ❌ | ❌ |
| Offline HTML dashboard | ✅ | ❌ | ⚠️ server | ✅ | ❌ |
| MLflow / W&B integration | ✅ | ❌ | ❌ | ⚠️ | ❌ |
| Plugin system | ✅ | ❌ | ✅ | ❌ | ❌ |
| Lines of code to get started | 1 | 50+ | 20+ | 10+ | 10+ |
datamend integrates with the tools data scientists already use — experiment trackers, version control systems, and CI/CD pipelines.
Log MendScore, repair counts, drift metrics, and full pipeline reports as JSON artifacts in any MLflow run.
from datamend.integrations.mlflow import log_pipeline_result
with mlflow.start_run():
log_pipeline_result(result)
Stream data quality metrics alongside model performance to your W&B dashboard in real time.
from datamend.integrations.wandb import log_repair
wandb.init(project="my-ml-project")
log_repair(repair_report)
Save repair and drift metrics as DVC-tracked JSON files alongside your data pipeline for reproducibility.
from datamend.integrations.dvc import save_repair_metrics
save_repair_metrics(report, path="metrics/repair.json")
Native feature importance extraction from any sklearn estimator with .feature_importances_ or .coef_ attributes.
from sklearn.ensemble import RandomForestClassifier
report = datamend.trace(rf_model, df, preds)
Plug in XGBClassifier or XGBRegressor. datamend extracts feature importances natively — no wrappers needed.
from xgboost import XGBRegressor
report = datamend.trace(xgb_model, df, preds)
Works with any nn.Module. Surrogate DecisionTreeRegressor used for attribution when native importances are unavailable.
import torch.nn as nn
report = datamend.trace(my_net, df, preds)
Install datamend in 10 seconds. Get your first MendScore in 30 seconds. Save 10+ hours every week.