Exploring Process Trends with Pandas: Descriptive Statistics and Data Insights from a Manufacturing Dataset

Overview

This blog dives into exploratory data analysis (EDA) using Pandas on a real-world manufacturing dataset from a continuous casting process. Covering descriptive statistics, correlation and covariance, ranking, and rolling analysis, this post is designed to help analysts, engineers, and data scientists understand how to use Pandas tools to generate meaningful process insights.

Dataset: ccm_rul_dataset.csv
Industry: Steel Manufacturing (Continuous Casting)
Focus: RUL (Remaining Useful Life), composition percentages, temperature, speed, and weight.

What You’ll Learn

How to profile large manufacturing datasets using summary statistics.
Ways to detect outliers, patterns, and process drifts.
How correlation and covariance reveal variable relationships.
Why ranking, percent changes, and rolling means help in production monitoring.
How missing data impacts analysis—and what to do about it.

Descriptive Statistics That Reveal the Hidden Story

Steel Weight (steel_weight, tonn)

Mean ≈ 163.94
Median ≈ 163.30
IQR ≈ 6.22
Suggests a symmetric and consistent distribution.

RUL Range

Over 8.47 million → Huge spread in lifespan values.
May indicate outliers or long historical tracking.

Technical Trim (technical_trim, tonn)

Only 76 valid records (0.43%)
Sparse column; unreliable for modeling, but great for triggering binary quality control events.

Outliers and Extremes

Max Temperature (temperature_measurement1)

Occurred on 2020-01-05
Can indicate critical process moments or thermal stress.

High C, % (Carbon Content)

Top 25% → 4,358 records
Might reflect specialty steels, or risk non-compliance if unintentional.

Outlier Detection with IQR

Manganese (Mn, %) IQR ≈ 0.48
Shows tight control, ideal for consistent steel quality.

Value Counts and Membership

steel_type

12 unique types
Arm500, St3sp, and 1015 dominate → Ideal for focused analysis.

alloy_type

99% are open systems
Only 1% close alloying → used in precision-critical products.

doc_requirement == 'T1'

Not found — possibly a mistaken label or unused spec.

Ranking for Insights

RUL Rankings

Top performers: Arm500, 1015
Repeated ties → resolved using "average" and "first" ranking methods.

Operational Insight

Batches with top RUL likely underwent optimal processing conditions, like better casting speed, uniform temperatures, or effective alloying.

Changes Over Time: Percent Change, Differences & Rolling Windows

Percent Change in steel_weight, tonn

Detects large deviations (>10%) → suggests process shifts, manual changes, or batch transitions.

Biggest Drop in RUL

Nearly -100% drops in May–June 2020
Indicates sudden degradation or critical failure, likely linked to equipment or sensor faults.

Rolling and Expanding Means

5-point rolling mean of RUL: Smooths noise; ideal for trend detection.
Expanding mean of temperature: Tracks stabilization or process drift over time.

Correlation and Covariance

Pearson Correlation

RUL vs. Temperature → ~0.005
No linear relationship → explore nonlinear interactions.

Chemical Composition Correlations

Ti, % and Ce, %: strong (≈ 0.81)
S, %: weak correlations → likely an independent variable.

Covariance

RUL ⬆ with alloy_speed
RUL ⬇ with swing_frequency
Useful for identifying variable influence direction in modeling.

Dealing with Missing Data

Key Observations

technical_trim and residuals_grab2 → over 90% missing
steel_weight and alloy_speed → complete

Strategies

Drop highly incomplete columns.
Impute with mean(), median(), or fillna().
Use flags like was_trimmed = df['technical_trim, tonn'].notna().astype(int)

Group-Based Summary Stats

Group by steel_type → avg RUL

1015: highest RUL
Arm240: lowest
Use this to optimize material selection and predict reliability.

Group by num_crystallizer → std of alloy speed

Crystallizer 3 & 5 → higher variation → investigate

Final Takeaways

Insight Type	Meaning
Symmetric distribution in steel weight	Stable process
High carbon in 25% of data	Specialized alloys or process drift
Sudden RUL drops	Possible failures or batch issues
Open vs Close alloying	Process control vs precision
Ranking RUL	Highlights optimal configurations
Rolling stats	Reveal trends, reduce noise
Missing data	Must be handled before modeling

Ready to Learn From Code?

Access the full notebook and source code here:
🔗 GitHub Repo

Let’s Connect!

If you enjoyed this and want more tutorials like it, follow me:

🎥 YouTube
👩‍💻 GitHub
💼 LinkedIn
📱 Instagram
📘 Facebook

Thanks so much for dropping by.

Exploring Process Trends with Pandas: Descriptive Statistics and Data Insights from a Manufacturing Dataset

Overview

What You’ll Learn

Descriptive Statistics That Reveal the Hidden Story

Outliers and Extremes

Value Counts and Membership

Ranking for Insights

Changes Over Time: Percent Change, Differences & Rolling Windows

Correlation and Covariance

Dealing with Missing Data

Group-Based Summary Stats

Final Takeaways

Ready to Learn From Code?

Let’s Connect!

Author: MICHELLE.ALZOLA

Leave a Reply Cancel reply

Overview

What You’ll Learn

Descriptive Statistics That Reveal the Hidden Story

Outliers and Extremes

Value Counts and Membership

Ranking for Insights

Changes Over Time: Percent Change, Differences & Rolling Windows

Correlation and Covariance

Dealing with Missing Data

Group-Based Summary Stats

Final Takeaways

Ready to Learn From Code?

Let’s Connect!

Author: MICHELLE.ALZOLA

Related Posts

Leave a Reply Cancel reply