Cleaning Dirty Data: Solving Real-World Problems in Pandas

When it comes to real-world data, nothing is ever clean. Whether it’s malformed email addresses, inconsistent phone formats, or production logs riddled with gaps — your job as a data analyst is to tame the chaos. In this post, I share two sets of practical data cleaning and preparation problems I tackled using Python and Pandas, designed to simulate the challenges found in sales operations and manufacturing QA environments.

GitHub Repo: Pandas-Data-Cleaning-and-Preparation

Set 1: Filament Sales Data (Fila Heat April 2025)

This dataset mimics customer records, purchases, and material specifications from a filament sales business. The problems revolved around:

  • Customer cleanup: Removing duplicates, normalizing zip codes, validating emails.
  • Sales anomalies: Catching tax calculation errors, rounding inconsistencies, and duplicate store entries.
  • Material trends: Parsing complex product codes, detecting high-cost combinations, and calculating revenue quartiles.
  • Product segmentation: Using statistical thresholds and outlier detection to categorize products and customer spending behavior.
  • A/B test simulation: Randomly assigning 20% of customers to a test group using reproducible sampling.

These problems help build confidence in string manipulation, regex, rounding logic, groupby tricks, and custom categorization strategies.

Set 2: Spool Manufacturing Batch Log

This one’s closer to my heart: 30 days of batch production logs that reflect shift schedules, QA outcomes, and operator activities. Key tasks included:

  • Contact validation: Parsing phone numbers with extensions and extracting country codes, while validating emails using regex and logical rules.
  • Operator analytics: Grouping performance by shifts, finding overworked inspectors, and calculating standardized scrap rates via z-scores.
  • Production cleanup: Detecting duplicate machine-lot-material combinations and checking barcode formats.
  • Shift imbalance detection: Pivoting weekday-shift matrices to uncover scheduling holes and batch shortages.
  • Experimental design: Assigning batch groups for QA experiments and comparing results using visualizations.

This set hones deeper on data validation, missing value logic, outlier detection, and cross-dimensional grouping.

Why These Exercises Matter

Too often, practice datasets are either too clean or too theoretical. These problem sets mimic real mess:

You’ll work with inconsistent formats, unexpected edge cases, and missing values.
You’ll apply a variety of pandas techniques in one workflow: groupby, merge, pivot, sample, and apply.
You’ll get used to interpreting messy logs and building up your QA mindset — a vital soft skill in any data role.

Explore the GitHub Repo

The entire source code, problem breakdowns, datasets, and analysis are available on GitHub:

https://github.com/michellealzola/Pandas-Data-Cleaning-and-Preparation

Let’s Connect!

🎥 YouTube
👩‍💻 GitHub
💼 LinkedIn
📱 Instagram
📘 Facebook

Thanks so much for dropping by.