puredata — Automatic Data Cleaning & Drift Detection

Pillar I — AutoClean

Nine-Stage Cleaning Pipeline

Every dataset passes through nine intelligent stages in order. Encoding is fixed before whitespace, types before dates, duplicates before category normalisation. Deterministic, reproducible, auditable.

Encoding
Repair

→

Whitespace
Norm

→

Type
Coercion

→

Date
Standard

→

Duplicate
Removal

→

Category
Norm

→

Unit
Norm

→

Null
Imputation

→

Outlier
Detection

STAGE 01

Encoding Repair

Strips BOM markers (), zero-width spaces (U+200B), removes control characters, normalises Unicode to NFC. Handles latin-1 mojibake automatically.

chardet · unicodedata

STAGE 02

Whitespace Normalisation

Strips leading/trailing whitespace, collapses repeated spaces, normalises line breaks. Applied to all string and object columns.

pandas str.strip · re.sub

STAGE 03

Type Coercion

Detects columns that look numeric but are stored as strings (e.g. "42.0"), safely converts them. Preserves genuine string columns.

pd.to_numeric · dtype inference

STAGE 04

Date Standardisation

Detects 12+ date formats via dateutil. Converts to ISO 8601 datetime64. Excludes columns where >50% of values are purely numeric.

python-dateutil · pandas Timestamp

STAGE 05

Duplicate Removal

Removes exact duplicate rows, keeping first occurrence. Target columns are excluded from deduplication to protect label integrity.

DataFrame.drop_duplicates

STAGE 06

Category Normalisation

Clusters inconsistent labels using fuzzy matching (RapidFuzz ≥85 ratio) plus prefix/abbreviation detection. "M" → "Male", "NY" → "New York".

rapidfuzz · prefix clustering

STAGE 07

Unit Normalisation

Detects mixed measurement units (kg/lbs, km/mi, °C/°F) within a column, converts all values to the dominant unit using exact conversion factors.

regex unit parsing · SI conversion

STAGE 08

Null Imputation

Adaptive strategy: KNNImputer (0–40% missing) → IterativeImputer/MICE (40–99%) → zero fill (100%). Categorical nulls filled with mode.

sklearn KNNImputer · IterativeImputer

STAGE 09

Ensemble Outlier Detection

Four detectors vote: IQR (1.5×), Z-score (|z|>3), Isolation Forest, Local Outlier Factor. A row is flagged only when the vote fraction exceeds your threshold. No single algorithm dominates.

IQR · Z-score · IsolationForest · LOF

      PYTHON
      
import puredata

# One-line clean with full report

clean_df, report = puredata.clean(df)

# Fine-grained control

from puredata.core.clean import AutoClean, AutoCleanConfig

config = AutoCleanConfig(

    fix_nulls=True, fix_outliers=True,

    fix_categories=True, fix_units=True,

    outlier_threshold=0.6,  # 60% vote needed to flag

    target_col="price"       # protect this column

)

engine = AutoClean(config=config)

clean_df, report = engine.clean(df)

print(report.mend_score)      # 0–100 health score

print(report.summary())       # human-readable text

report.to_html("report.html") # full HTML report

Pillar II — DataWatch

Seven Silent Compatibility Checks

Fit a contract on your training data. Validate any production batch silently. DataWatch catches the silent killers — schema mutations, null spikes, distribution drift — before they corrupt predictions. Every check returns pass / warn / fail with an exact message.

Schema Validation

HARD FAIL

Detects columns that appeared, disappeared, or changed dtype between training and production. Column renames and type promotions are both caught.

Dtype Compatibility

WARN

Checks that the physical dtype matches the contract profile. A float64 column that becomes object — from a corrupted CSV — is immediately flagged.

Null Rate Spike

WARN

Compares null fraction against the baseline profile. Flags if the new null rate exceeds the reference by more than the configured tolerance (default 0.05).

Range Violation

FAIL

Enforces [min, max] from the contract for numeric columns. Returns the exact out-of-range count and the violated bound.

Distribution Drift — Dual Gate

FAIL

Two independent tests must both agree: PSI (Population Stability Index) measures bin-level shifts; KS test measures CDF distance. Requiring both prevents false positives.

Cardinality Violation

WARN

Detects unseen categories and cardinality explosions. A new label that never appeared in training is reported by name, not just count.

Custom Rules

EXTENSIBLE

Attach any callable as a rule: lambda df, col: (df[col] > 0).all(). Rules are stored in the JSON contract and re-evaluated on every check call.

      PYTHON
      
# Fit contract on training data

contract = puredata.watch(train_df)

contract.save("contract.json")  # persist for production

# Validate production batch

result = puredata.check(prod_df, contract)

print(result.compatibility_score)  # 0–100

print(result.passed)               # True / False

# Raise automatically in CI/CD

result.raise_if_failed()  # DataCompatibilityError on failure

# Load and re-validate

contract = puredata.DataContract.load("contract.json")

result = puredata.check(new_df, contract, mode="strict")

Comparison

Why puredata?

Compared to the most popular data quality tools in the Python ecosystem.

CAPABILITY	puredata	pandas	pyjanitor	great_expectations	evidently
Auto null imputation (adaptive)	✓	—	—	—	—
Ensemble outlier detection	✓	—	—	—	—
Fuzzy category normalisation	✓	—	~	—	—
Mixed unit normalisation	✓	—	—	—	—
Encoding repair (BOM, ZWS)	✓	—	—	—	—
Dual-gate drift detection	✓	—	—	—	✓
JSON-persistent data contract	✓	—	—	~	~
One-line API	✓	—	~	—	—
MendScore + repair report	✓	—	—	—	—
HTML / JSON / CSV reports	✓	—	—	✓	✓
sklearn-compatible pipeline	✓	—	—	—	—
MLflow / W&B / DVC integration	✓	—	—	~	~
CLI (clean / watch / check)	✓	—	—	—	—
Plugin entry-point system	✓	—	—	—	—

Performance

Benchmark Numbers

Measured on a 2024 MacBook Pro M3, single-threaded, pandas 2.x.

CLEAN — 10K ROWS × 20 COLS

0.8s

Full 9-stage pipeline including outlier detection

CLEAN — 100K ROWS × 20 COLS

4.2s

KNN imputation dominates at this scale

WATCH FIT — 50K ROWS

0.3s

Profile once, validate forever

WATCH CHECK — 10K ROWS

0.1s

All 7 checks including KS test and PSI

Integrations

Works With Your Entire Stack

Native integrations for experiment tracking and pipeline tools. puredata slots into your workflow without ceremony.

🔶

MLflow

Log MendScore, fix counts, and the full report JSON as artifacts automatically on every experiment run.

🟣

Weights & Biases

Stream cleaning metrics to W&B runs. Visualise data health alongside model accuracy and loss.

🔵

DVC

Write cleaning metrics to dvc.json for pipeline tracking and experiment comparison across commits.

🟢

Polars

Pass a Polars DataFrame directly — puredata converts, cleans, and returns pandas or polars as needed.

🟡

scikit-learn

MendPipeline wraps AutoClean + DataWatch in a sklearn-compatible fit_transform / transform interface.

📁

File I/O

Native support for CSV, Excel, Parquet, and JSON via both Python API and CLI — no extra setup.

      PYTHON — INTEGRATIONS
      
from puredata.integrations.mlflow import log_clean_report

import mlflow

with mlflow.start_run():

    clean_df, report = puredata.clean(df)

    log_clean_report(report)  # metrics + artifact logged

# sklearn-compatible pipeline

from puredata import MendPipeline

pipeline = MendPipeline(watch_mode="strict")

X_clean = pipeline.fit_transform(X_train)

X_prod  = pipeline.transform(X_prod)   # raises if drift detected

Command Line

Full CLI — No Python Required

Every feature is accessible from the terminal. Clean CSVs, fit contracts, validate production data — all without writing a single line of Python.

      TERMINAL
      
# Clean a CSV and save the report

$ puredata clean data.csv -o clean.csv --report-html report.html

# Fit a contract on training data

$ puredata watch train.csv --contract contract.json

# Validate production batch (exit code 1 if failures)

$ puredata check prod.csv contract.json --strict

# Get the health score of any file

$ puredata score mydata.csv

MendScore: 87/100

# Open the interactive dashboard

$ puredata dashboard mydata.csv

clean

Full 9-stage pipeline on any CSV, Excel, Parquet, or JSON file. Saves output and optional HTML report.

watch + check

Fit a contract once, validate forever. Exit code 1 on failure for easy CI/CD integration.

score + dashboard

Instant health score and self-contained HTML dashboard — no server needed.

Roadmap

What's Next

Where puredata is going — shipped, in progress, and planned.

✓

AutoClean — 9-stage pipeline

Encoding, whitespace, types, dates, duplicates, categories, units, nulls, outliers — all shipped.

Live in v0.2

✓

DataWatch — 7 silent checks

Schema, dtype, null rate, range, drift (dual-gate), cardinality, custom rules — all shipped.

Live in v0.2

✓

CLI, Dashboard, Plugin System

Full terminal interface, self-contained HTML dashboard, entry-point plugin registry.

Live in v0.2

✓

MLflow, W&B, DVC, sklearn integrations

Native connectors for the most popular ML infrastructure tools.

Live in v0.2

→

Streaming / chunked cleaning

Process datasets that don't fit in memory using chunked pandas reads or Polars lazy frames.

→

LLM-powered category clustering

Use local LLMs (via Ollama) to cluster semantically similar categories beyond fuzzy string matching.

Planned

○

Spark / Dask backend

Scale the full pipeline to distributed dataframes without changing the one-line API.

Future

○

Visual contract editor

A web UI to inspect, edit, and version data contracts alongside the existing JSON format.

Future

Automatic Data Cleaning
in One Line

Nine-Stage Cleaning Pipeline

Seven Silent Compatibility Checks

Why puredata?

Benchmark Numbers

Works With Your Entire Stack

Full CLI — No Python Required

What's Next

Start Cleaning Data in 30 Seconds

Automatic Data Cleaningin One Line

Nine-Stage Cleaning Pipeline

Seven Silent Compatibility Checks

Why puredata?

Benchmark Numbers

Works With Your Entire Stack

Full CLI — No Python Required

What's Next

Start Cleaning Data in 30 Seconds

Automatic Data Cleaning
in One Line