When it comes to real-world data, nothing is ever clean. Whether it’s malformed email addresses, inconsistent phone formats, or production logs riddled with gaps — your job as a data analyst is to tame the chaos. In this post, I share two sets of practical data cleaning and preparation problems I tackled using Python and Pandas, designed to simulate the challenges found in sales operations and manufacturing QA environments.

GitHub Repo: Pandas-Data-Cleaning-and-Preparation
Set 1: Filament Sales Data (Fila Heat April 2025)
This dataset mimics customer records, purchases, and material specifications from a filament sales business. The problems revolved around:
- Customer cleanup: Removing duplicates, normalizing zip codes, validating emails.
- Sales anomalies: Catching tax calculation errors, rounding inconsistencies, and duplicate store entries.
- Material trends: Parsing complex product codes, detecting high-cost combinations, and calculating revenue quartiles.
- Product segmentation: Using statistical thresholds and outlier detection to categorize products and customer spending behavior.
- A/B test simulation: Randomly assigning 20% of customers to a test group using reproducible sampling.
These problems help build confidence in string manipulation, regex, rounding logic, groupby tricks, and custom categorization strategies.
Set 2: Spool Manufacturing Batch Log
This one’s closer to my heart: 30 days of batch production logs that reflect shift schedules, QA outcomes, and operator activities. Key tasks included:
- Contact validation: Parsing phone numbers with extensions and extracting country codes, while validating emails using regex and logical rules.
- Operator analytics: Grouping performance by shifts, finding overworked inspectors, and calculating standardized scrap rates via z-scores.
- Production cleanup: Detecting duplicate machine-lot-material combinations and checking barcode formats.
- Shift imbalance detection: Pivoting weekday-shift matrices to uncover scheduling holes and batch shortages.
- Experimental design: Assigning batch groups for QA experiments and comparing results using visualizations.
This set hones deeper on data validation, missing value logic, outlier detection, and cross-dimensional grouping.
Why These Exercises Matter
Too often, practice datasets are either too clean or too theoretical. These problem sets mimic real mess:
You’ll work with inconsistent formats, unexpected edge cases, and missing values.
You’ll apply a variety of pandas techniques in one workflow: groupby, merge, pivot, sample, and apply.
You’ll get used to interpreting messy logs and building up your QA mindset — a vital soft skill in any data role.
Explore the GitHub Repo
The entire source code, problem breakdowns, datasets, and analysis are available on GitHub:
https://github.com/michellealzola/Pandas-Data-Cleaning-and-Preparation
Let’s Connect!

🎥 YouTube
👩💻 GitHub
💼 LinkedIn
📱 Instagram
📘 Facebook
Thanks so much for dropping by.


