pip install puredatalib

Automatic Data Cleaning
in One Line

puredata cleans any dirty dataset automatically — missing values, outliers, encoding bugs, mixed units, fuzzy categories. Then silently watches for data drift in production. Nine cleaning stages. Seven compatibility checks. Zero configuration.

$ pip install puredatalib
9
CLEANING STAGES
7
WATCH CHECKS
93%
TEST COVERAGE
1
LINE OF CODE
v0.2
CURRENT VERSION
Explore AutoClean Explore DataWatch View on GitHub
Pillar I — AutoClean

Nine-Stage Cleaning Pipeline

Every dataset passes through nine intelligent stages in order. Encoding is fixed before whitespace, types before dates, duplicates before category normalisation. Deterministic, reproducible, auditable.

01
Encoding
Repair
02
Whitespace
Norm
03
Type
Coercion
04
Date
Standard
05
Duplicate
Removal
06
Category
Norm
07
Unit
Norm
08
Null
Imputation
09
Outlier
Detection
STAGE 01
Encoding Repair
Strips BOM markers (), zero-width spaces (U+200B), removes control characters, normalises Unicode to NFC. Handles latin-1 mojibake automatically.
chardet · unicodedata
STAGE 02
Whitespace Normalisation
Strips leading/trailing whitespace, collapses repeated spaces, normalises line breaks. Applied to all string and object columns.
pandas str.strip · re.sub
STAGE 03
Type Coercion
Detects columns that look numeric but are stored as strings (e.g. "42.0"), safely converts them. Preserves genuine string columns.
pd.to_numeric · dtype inference
STAGE 04
Date Standardisation
Detects 12+ date formats via dateutil. Converts to ISO 8601 datetime64. Excludes columns where >50% of values are purely numeric.
python-dateutil · pandas Timestamp
STAGE 05
Duplicate Removal
Removes exact duplicate rows, keeping first occurrence. Target columns are excluded from deduplication to protect label integrity.
DataFrame.drop_duplicates
STAGE 06
Category Normalisation
Clusters inconsistent labels using fuzzy matching (RapidFuzz ≥85 ratio) plus prefix/abbreviation detection. "M" → "Male", "NY" → "New York".
rapidfuzz · prefix clustering
STAGE 07
Unit Normalisation
Detects mixed measurement units (kg/lbs, km/mi, °C/°F) within a column, converts all values to the dominant unit using exact conversion factors.
regex unit parsing · SI conversion
STAGE 08
Null Imputation
Adaptive strategy: KNNImputer (0–40% missing) → IterativeImputer/MICE (40–99%) → zero fill (100%). Categorical nulls filled with mode.
sklearn KNNImputer · IterativeImputer
STAGE 09
Ensemble Outlier Detection
Four detectors vote: IQR (1.5×), Z-score (|z|>3), Isolation Forest, Local Outlier Factor. A row is flagged only when the vote fraction exceeds your threshold. No single algorithm dominates.
IQR · Z-score · IsolationForest · LOF
PYTHON
import puredata

# One-line clean with full report
clean_df, report = puredata.clean(df)

# Fine-grained control
from puredata.core.clean import AutoClean, AutoCleanConfig
config = AutoCleanConfig(
    fix_nulls=True, fix_outliers=True,
    fix_categories=True, fix_units=True,
    outlier_threshold=0.6, # 60% vote needed to flag
    target_col="price" # protect this column
)
engine = AutoClean(config=config)
clean_df, report = engine.clean(df)

print(report.mend_score) # 0–100 health score
print(report.summary()) # human-readable text
report.to_html("report.html") # full HTML report
Pillar II — DataWatch

Seven Silent Compatibility Checks

Fit a contract on your training data. Validate any production batch silently. DataWatch catches the silent killers — schema mutations, null spikes, distribution drift — before they corrupt predictions. Every check returns pass / warn / fail with an exact message.

Schema Validation
HARD FAIL
Detects columns that appeared, disappeared, or changed dtype between training and production. Column renames and type promotions are both caught.
Dtype Compatibility
WARN
Checks that the physical dtype matches the contract profile. A float64 column that becomes object — from a corrupted CSV — is immediately flagged.
Null Rate Spike
WARN
Compares null fraction against the baseline profile. Flags if the new null rate exceeds the reference by more than the configured tolerance (default 0.05).
Range Violation
FAIL
Enforces [min, max] from the contract for numeric columns. Returns the exact out-of-range count and the violated bound.
Distribution Drift — Dual Gate
FAIL
Two independent tests must both agree: PSI (Population Stability Index) measures bin-level shifts; KS test measures CDF distance. Requiring both prevents false positives.
Cardinality Violation
WARN
Detects unseen categories and cardinality explosions. A new label that never appeared in training is reported by name, not just count.
Custom Rules
EXTENSIBLE
Attach any callable as a rule: lambda df, col: (df[col] > 0).all(). Rules are stored in the JSON contract and re-evaluated on every check call.
PYTHON
# Fit contract on training data
contract = puredata.watch(train_df)
contract.save("contract.json") # persist for production

# Validate production batch
result = puredata.check(prod_df, contract)
print(result.compatibility_score) # 0–100
print(result.passed) # True / False

# Raise automatically in CI/CD
result.raise_if_failed() # DataCompatibilityError on failure

# Load and re-validate
contract = puredata.DataContract.load("contract.json")
result = puredata.check(new_df, contract, mode="strict")
Comparison

Why puredata?

Compared to the most popular data quality tools in the Python ecosystem.

CAPABILITY puredata pandas pyjanitor great_expectations evidently
Auto null imputation (adaptive)
Ensemble outlier detection
Fuzzy category normalisation~
Mixed unit normalisation
Encoding repair (BOM, ZWS)
Dual-gate drift detection
JSON-persistent data contract~~
One-line API~
MendScore + repair report
HTML / JSON / CSV reports
sklearn-compatible pipeline
MLflow / W&B / DVC integration~~
CLI (clean / watch / check)
Plugin entry-point system
Performance

Benchmark Numbers

Measured on a 2024 MacBook Pro M3, single-threaded, pandas 2.x.

CLEAN — 10K ROWS × 20 COLS
0.8s
Full 9-stage pipeline including outlier detection
CLEAN — 100K ROWS × 20 COLS
4.2s
KNN imputation dominates at this scale
WATCH FIT — 50K ROWS
0.3s
Profile once, validate forever
WATCH CHECK — 10K ROWS
0.1s
All 7 checks including KS test and PSI
Integrations

Works With Your Entire Stack

Native integrations for experiment tracking and pipeline tools. puredata slots into your workflow without ceremony.

🔶
MLflow
Log MendScore, fix counts, and the full report JSON as artifacts automatically on every experiment run.
🟣
Weights & Biases
Stream cleaning metrics to W&B runs. Visualise data health alongside model accuracy and loss.
🔵
DVC
Write cleaning metrics to dvc.json for pipeline tracking and experiment comparison across commits.
🟢
Polars
Pass a Polars DataFrame directly — puredata converts, cleans, and returns pandas or polars as needed.
🟡
scikit-learn
MendPipeline wraps AutoClean + DataWatch in a sklearn-compatible fit_transform / transform interface.
📁
File I/O
Native support for CSV, Excel, Parquet, and JSON via both Python API and CLI — no extra setup.
PYTHON — INTEGRATIONS
from puredata.integrations.mlflow import log_clean_report
import mlflow

with mlflow.start_run():
    clean_df, report = puredata.clean(df)
    log_clean_report(report) # metrics + artifact logged

# sklearn-compatible pipeline
from puredata import MendPipeline
pipeline = MendPipeline(watch_mode="strict")
X_clean = pipeline.fit_transform(X_train)
X_prod = pipeline.transform(X_prod) # raises if drift detected
Command Line

Full CLI — No Python Required

Every feature is accessible from the terminal. Clean CSVs, fit contracts, validate production data — all without writing a single line of Python.

TERMINAL
# Clean a CSV and save the report
$ puredata clean data.csv -o clean.csv --report-html report.html

# Fit a contract on training data
$ puredata watch train.csv --contract contract.json

# Validate production batch (exit code 1 if failures)
$ puredata check prod.csv contract.json --strict

# Get the health score of any file
$ puredata score mydata.csv
MendScore: 87/100

# Open the interactive dashboard
$ puredata dashboard mydata.csv
clean
Full 9-stage pipeline on any CSV, Excel, Parquet, or JSON file. Saves output and optional HTML report.
watch + check
Fit a contract once, validate forever. Exit code 1 on failure for easy CI/CD integration.
score + dashboard
Instant health score and self-contained HTML dashboard — no server needed.
Roadmap

What's Next

Where puredata is going — shipped, in progress, and planned.

AutoClean — 9-stage pipeline
Encoding, whitespace, types, dates, duplicates, categories, units, nulls, outliers — all shipped.
Live in v0.2
DataWatch — 7 silent checks
Schema, dtype, null rate, range, drift (dual-gate), cardinality, custom rules — all shipped.
Live in v0.2
CLI, Dashboard, Plugin System
Full terminal interface, self-contained HTML dashboard, entry-point plugin registry.
Live in v0.2
MLflow, W&B, DVC, sklearn integrations
Native connectors for the most popular ML infrastructure tools.
Live in v0.2
Streaming / chunked cleaning
Process datasets that don't fit in memory using chunked pandas reads or Polars lazy frames.
Next
LLM-powered category clustering
Use local LLMs (via Ollama) to cluster semantically similar categories beyond fuzzy string matching.
Planned
Spark / Dask backend
Scale the full pipeline to distributed dataframes without changing the one-line API.
Future
Visual contract editor
A web UI to inspect, edit, and version data contracts alongside the existing JSON format.
Future

Start Cleaning Data in 30 Seconds

Install puredata, clean your first dataset, and see the MendScore — all before your coffee gets cold.

Install from PyPI View Source on GitHub Read the Docs