Data Cleaning Skill
Clean CSV and Excel data with reproducible, documented steps and a full audit trail. Ideal for analytics prep, database imports, and report generation.
Data Cleaning Skill
Dirty data is the silent killer of analytics projects. A CRM export with inconsistent date formats, a supplier spreadsheet mixing UTF-8 and Latin-1 encoded characters, or a sales report where “New York”, “new york”, and “NY” all refer to the same city — these problems don’t announce themselves. They quietly corrupt aggregations, break database imports, and produce reports that look plausible but are wrong.
The Data Cleaning skill applies a structured, reproducible sequence of transformations to CSV and Excel files, producing a cleaned output alongside a step-by-step audit trail that documents every change made. This matters for compliance, for debugging, and for the next person who inherits your dataset.
What it does
- Encoding detection and normalization: Identifies whether a file is UTF-8, Latin-1 (ISO-8859-1), Windows-1252, or another encoding, then converts to a consistent target encoding. Flags characters that couldn’t be converted rather than silently dropping them.
- Type inference and coercion: Detects columns that contain dates stored as strings (e.g.,
"03/15/2026","15-Mar-26","2026.03.15") and standardizes them to ISO 8601 (2026-03-15). Similarly normalizes numeric columns that contain currency symbols, commas, or percentage signs. - Deduplication with configurable logic: Removes duplicate rows based on a primary key column or a composite key you specify. When duplicates exist, you choose the resolution strategy: keep first occurrence, keep last, keep the row with the most populated fields, or flag all duplicates for manual review.
- Null and placeholder handling: Distinguishes between genuinely empty cells and placeholder values like
"N/A","#N/A","-","NULL", or"n/a"— normalizing all of them to true nulls (or a value you specify) so downstream tools handle them consistently. - Column name standardization: Converts headers to a consistent naming convention (snake_case, camelCase, or Title Case), strips leading/trailing whitespace, and removes special characters that break SQL imports or pandas DataFrames.
- Audit trail generation: Produces a separate log file listing every transformation applied, the row and column affected, the original value, and the new value. This makes the cleaning process fully reproducible and reviewable.
Best for
- Pre-import database preparation: Cleaning a CSV before loading it into PostgreSQL, MySQL, or BigQuery, where type mismatches and encoding errors cause import failures.
- Analytics pipeline input: Ensuring that the data feeding a dashboard or model is consistent across monthly exports, even when the source system changes its export format.
- Report generation from external sources: Combining data from multiple vendors or departments where each team uses different conventions for dates, currencies, and categorical values.
- Compliance and audit requirements: Situations where you need to demonstrate exactly what changed in a dataset and why — for GDPR data processing records, financial audits, or data governance reviews.
This skill is rated Medium risk because data transformations are destructive by nature — the original values are replaced. Always work on a copy of your source file, and review the audit trail before treating the cleaned output as authoritative.
How to use (example)
Scenario: Cleaning a monthly sales export before loading into a data warehouse
Input file: sales_march_2026.csv (14,832 rows, exported from Salesforce)
Problems identified on inspection:
close_datecolumn has three different formats:MM/DD/YYYY,YYYY-MM-DD, andMonth DD, YYYYrevenuecolumn contains values like"$12,450.00","12450", and"#VALUE!"regioncolumn has"US-West","us west","USWest"— all meaning the same thing- 847 duplicate rows (same
opportunity_id, differentupdated_attimestamps) - File is Windows-1252 encoded; warehouse expects UTF-8
Prompt to the skill:
File: sales_march_2026.csv
Encoding: auto-detect, convert to UTF-8
Date columns: close_date — standardize to YYYY-MM-DD
Numeric columns: revenue — strip currency symbols and commas, convert to float, replace "#VALUE!" with null
Categorical normalization: region — lowercase, replace spaces and hyphens with underscore
Deduplication: key=opportunity_id, keep=last (most recent updated_at)
Output: cleaned_sales_march_2026.csv + audit_trail.csv
Output summary:
CLEANING REPORT — sales_march_2026.csv
Rows in: 14,832
Rows out: 13,985 (847 duplicates removed, kept last by updated_at)
Encoding: Windows-1252 → UTF-8 (3 characters substituted, see audit rows 4421, 7803, 11204)
Date fixes: 2,341 rows in close_date normalized to YYYY-MM-DD
Revenue fixes: 14,832 rows stripped of "$" and ","; 12 "#VALUE!" → null
Region fixes: 9 distinct variants → 3 canonical values (us_west, us_east, emea)
Column renames: CloseDate → close_date, Revenue__c → revenue, Region__c → region
Audit trail written to: audit_trail.csv (14,832 rows × 6 columns)
The audit trail CSV has columns: row_number, column, original_value, cleaned_value, transformation, timestamp. Every change is traceable.
Permissions & Risks
Required permissions: Files
Risk level: Medium
Key risks to understand:
- Destructive transformations: Cleaning replaces original values. If your dedup logic is wrong (e.g., you keep “first” when you should keep “last”), you lose data. Always keep the original file untouched and write output to a new file.
- Encoding conversion loss: Converting from Latin-1 to UTF-8 can produce substitution characters (
?or\ufffd) for characters that don’t map cleanly. The skill flags these, but you need to review them — especially for names with accented characters. - Dedup logic failures: Composite key deduplication can silently drop valid rows if your key columns contain nulls. Specify
null_key_behavior: keepto preserve rows where the key is incomplete rather than treating them as duplicates of each other. - Type coercion surprises: A column that looks numeric might contain meaningful text codes (e.g., product codes like
"007"that should stay as strings). Always specify which columns to coerce rather than letting the skill infer all types automatically.
Recommended guardrails:
- Always output to a new file — never overwrite the source.
- Review the audit trail before promoting cleaned data to production.
- Spot-check 20–30 rows from the cleaned output against the original, especially around the edge cases the skill flagged.
Troubleshooting
-
Encoding detection is wrong — output has garbled characters
Auto-detection works well for pure ASCII and UTF-8 but can misidentify Latin-1 and Windows-1252 (they’re similar). If you know the source encoding, specify it explicitly:encoding: windows-1252. If you don’t know, open the file in a hex editor and look for bytes in the0x80–0x9Frange — those are Windows-1252 specific. -
Deduplication removed rows it shouldn’t have
Check whether your key column contains nulls. Two rows with a null primary key are treated as duplicates of each other by default. Setnull_key_behavior: keepto preserve them. Also verify thekeepstrategy —keep: lastrequires a sortable timestamp column; if that column is also null, the result is undefined. -
Date normalization produced wrong dates
Ambiguous dates like03/04/2026could be March 4 or April 3 depending on locale. Specifydate_locale: en-US(MM/DD/YYYY) ordate_locale: en-GB(DD/MM/YYYY) to resolve ambiguity. The skill will flag dates it couldn’t parse unambiguously rather than guessing. -
Column name standardization broke a downstream script
If a downstream SQL query or Python script references the original column names, renaming them will break it. Userename_columns: falseto skip header normalization, or provide an explicit mapping:rename: {"CloseDate": "close_date"}to control exactly which columns change. -
File is too large — skill times out
For files over 100,000 rows, process in chunks. Split by row count (split -l 50000 file.csv chunk_) or by a categorical column (e.g., one file per region). Clean each chunk separately, then concatenate the cleaned outputs. -
“#VALUE!” and other Excel error strings not being caught
Excel exports can contain a range of error strings:#VALUE!,#REF!,#DIV/0!,#N/A,#NAME?. Specifyexcel_errors: nullto replace all of them with true nulls, orexcel_errors: keepif you want to preserve them as strings for manual review.
Alternatives
- OpenRefine: Open-source desktop tool with a visual interface for exploring and transforming messy data. Excellent for one-off cleaning tasks where you want to interactively inspect the data before committing to transformations. Slower for large files and doesn’t integrate into automated pipelines.
- pandas scripts: Writing a Python script with pandas gives you maximum control and is fully reproducible via version control. The tradeoff is that it requires Python knowledge and setup time. Best for teams with engineering resources who need cleaning logic embedded in a data pipeline.
- Trifacta Wrangler (now Alteryx Designer Cloud): Enterprise-grade data preparation platform with a visual recipe builder and ML-assisted transformation suggestions. Handles very large datasets and integrates with cloud data warehouses. Significant cost and onboarding overhead — overkill for ad-hoc cleaning tasks.
The Data Cleaning skill sits between OpenRefine (interactive, visual) and pandas (code-first, flexible). It’s best when you need reproducible, documented cleaning without writing code.
Source
See provider documentation for installation and configuration details.
Related
Skills:
- Spreadsheet Formulas — generate formulas to validate or transform data within a spreadsheet
- Log Analyzer — parse and clean structured log exports
Guides:
- Best Skills for Data Work — overview of the full data preparation toolkit
- Safe Skill Workflows — how to handle file-based transformations without data loss