Exploring Process Trends with Pandas: Descriptive Statistics and Data Insights from a Manufacturing Dataset

Overview

This blog dives into exploratory data analysis (EDA) using Pandas on a real-world manufacturing dataset from a continuous casting process. Covering descriptive statistics, correlation and covariance, ranking, and rolling analysis, this post is designed to help analysts, engineers, and data scientists understand how to use Pandas tools to generate meaningful process insights.

Dataset: ccm_rul_dataset.csv
Industry: Steel Manufacturing (Continuous Casting)
Focus: RUL (Remaining Useful Life), composition percentages, temperature, speed, and weight.

What You’ll Learn

  • How to profile large manufacturing datasets using summary statistics.
  • Ways to detect outliers, patterns, and process drifts.
  • How correlation and covariance reveal variable relationships.
  • Why ranking, percent changes, and rolling means help in production monitoring.
  • How missing data impacts analysis—and what to do about it.

Descriptive Statistics That Reveal the Hidden Story

Steel Weight (steel_weight, tonn)

  • Mean ≈ 163.94
  • Median ≈ 163.30
  • IQR ≈ 6.22
    Suggests a symmetric and consistent distribution.

RUL Range

  • Over 8.47 million → Huge spread in lifespan values.
    May indicate outliers or long historical tracking.

Technical Trim (technical_trim, tonn)

  • Only 76 valid records (0.43%)
    Sparse column; unreliable for modeling, but great for triggering binary quality control events.

Outliers and Extremes

Max Temperature (temperature_measurement1)

  • Occurred on 2020-01-05
  • Can indicate critical process moments or thermal stress.

High C, % (Carbon Content)

  • Top 25% → 4,358 records
    Might reflect specialty steels, or risk non-compliance if unintentional.

Outlier Detection with IQR

  • Manganese (Mn, %) IQR ≈ 0.48
    Shows tight control, ideal for consistent steel quality.

Value Counts and Membership

steel_type

  • 12 unique types
  • Arm500, St3sp, and 1015 dominate → Ideal for focused analysis.

alloy_type

  • 99% are open systems
  • Only 1% close alloying → used in precision-critical products.

doc_requirement == 'T1'

  • Not found — possibly a mistaken label or unused spec.

Ranking for Insights

RUL Rankings

  • Top performers: Arm500, 1015
  • Repeated ties → resolved using "average" and "first" ranking methods.

Operational Insight

Batches with top RUL likely underwent optimal processing conditions, like better casting speed, uniform temperatures, or effective alloying.

Changes Over Time: Percent Change, Differences & Rolling Windows

Percent Change in steel_weight, tonn

  • Detects large deviations (>10%) → suggests process shifts, manual changes, or batch transitions.

Biggest Drop in RUL

  • Nearly -100% drops in May–June 2020
    Indicates sudden degradation or critical failure, likely linked to equipment or sensor faults.

Rolling and Expanding Means

  • 5-point rolling mean of RUL: Smooths noise; ideal for trend detection.
  • Expanding mean of temperature: Tracks stabilization or process drift over time.

Correlation and Covariance

Pearson Correlation

  • RUL vs. Temperature → ~0.005
    No linear relationship → explore nonlinear interactions.

Chemical Composition Correlations

  • Ti, % and Ce, %: strong (≈ 0.81)
  • S, %: weak correlations → likely an independent variable.

Covariance

  • RUL ⬆ with alloy_speed
  • RUL ⬇ with swing_frequency
    Useful for identifying variable influence direction in modeling.

Dealing with Missing Data

Key Observations

  • technical_trim and residuals_grab2 → over 90% missing
  • steel_weight and alloy_speed → complete

Strategies

  • Drop highly incomplete columns.
  • Impute with mean(), median(), or fillna().
  • Use flags like was_trimmed = df['technical_trim, tonn'].notna().astype(int)

Group-Based Summary Stats

Group by steel_type → avg RUL

  • 1015: highest RUL
  • Arm240: lowest
    Use this to optimize material selection and predict reliability.

Group by num_crystallizer → std of alloy speed

  • Crystallizer 3 & 5 → higher variation → investigate

Final Takeaways

Insight TypeMeaning
Symmetric distribution in steel weightStable process
High carbon in 25% of dataSpecialized alloys or process drift
Sudden RUL dropsPossible failures or batch issues
Open vs Close alloyingProcess control vs precision
Ranking RULHighlights optimal configurations
Rolling statsReveal trends, reduce noise
Missing dataMust be handled before modeling

Ready to Learn From Code?

Access the full notebook and source code here:
🔗 GitHub Repo

Let’s Connect!

If you enjoyed this and want more tutorials like it, follow me:

🎥 YouTube
👩‍💻 GitHub
💼 LinkedIn
📱 Instagram
📘 Facebook

Thanks so much for dropping by.

Leave a Reply

Your email address will not be published. Required fields are marked *