puredata cleans any dirty dataset automatically — missing values, outliers, encoding bugs, mixed units, fuzzy categories. Then silently watches for data drift in production. Nine cleaning stages. Seven compatibility checks. Zero configuration.
Every dataset passes through nine intelligent stages in order. Encoding is fixed before whitespace, types before dates, duplicates before category normalisation. Deterministic, reproducible, auditable.
Fit a contract on your training data. Validate any production batch silently. DataWatch catches the silent killers — schema mutations, null spikes, distribution drift — before they corrupt predictions. Every check returns pass / warn / fail with an exact message.
lambda df, col: (df[col] > 0).all(). Rules are stored in the JSON contract and re-evaluated on every check call.Compared to the most popular data quality tools in the Python ecosystem.
| CAPABILITY | puredata | pandas | pyjanitor | great_expectations | evidently |
|---|---|---|---|---|---|
| Auto null imputation (adaptive) | ✓ | — | — | — | — |
| Ensemble outlier detection | ✓ | — | — | — | — |
| Fuzzy category normalisation | ✓ | — | ~ | — | — |
| Mixed unit normalisation | ✓ | — | — | — | — |
| Encoding repair (BOM, ZWS) | ✓ | — | — | — | — |
| Dual-gate drift detection | ✓ | — | — | — | ✓ |
| JSON-persistent data contract | ✓ | — | — | ~ | ~ |
| One-line API | ✓ | — | ~ | — | — |
| MendScore + repair report | ✓ | — | — | — | — |
| HTML / JSON / CSV reports | ✓ | — | — | ✓ | ✓ |
| sklearn-compatible pipeline | ✓ | — | — | — | — |
| MLflow / W&B / DVC integration | ✓ | — | — | ~ | ~ |
| CLI (clean / watch / check) | ✓ | — | — | — | — |
| Plugin entry-point system | ✓ | — | — | — | — |
Measured on a 2024 MacBook Pro M3, single-threaded, pandas 2.x.
Native integrations for experiment tracking and pipeline tools. puredata slots into your workflow without ceremony.
Every feature is accessible from the terminal. Clean CSVs, fit contracts, validate production data — all without writing a single line of Python.
Where puredata is going — shipped, in progress, and planned.
Install puredata, clean your first dataset, and see the MendScore — all before your coffee gets cold.