Friday, April 18, 2025

Data Cleaning Pipeline using Pandas

 The use of Machine Learning Algorithms is not required for every dataset, but every dataset does require some data cleaning steps. So, building a data cleaning pipeline using Pandas is one of those bread-and-butter tasks every Data Analyst should master early on. In this article, I’ll take you through a tutorial on building a data cleaning pipeline using Pandas.

Data Cleaning Pipeline using Pandas

The dataset I will be using for this tutorial on building a data cleaning pipeline using Pandas is based on predicting loan approvals. You can download this dataset from here.

Let’s get started with this tutorial by importing the dataset:

The dataset contains more columns

Step 1: Understand the Data

Let’s explore the dataset to understand what needs fixing:

Rows: 614, Columns: 13
<class 'pandas.core.frame.DataFrame'>
Index: 614 entries, 0 to 613
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Loan_ID 614 non-null object
1 Gender 601 non-null object
2 Married 611 non-null object
3 Dependents 599 non-null object
4 Education 614 non-null object
5 Self_Employed 582 non-null object
6 ApplicantIncome 614 non-null int64
7 CoapplicantIncome 614 non-null float64
8 LoanAmount 592 non-null float64
9 Loan_Amount_Term 600 non-null float64
10 Credit_History 564 non-null float64
11 Property_Area 614 non-null object
12 Loan_Status 614 non-null object
dtypes: float64(4), int64(1), object(8)
memory usage: 67.2+ KB
         Loan_ID Gender Married Dependents Education Self_Employed  \
count 614 601 611 599 614 582
unique 614 2 2 4 2 2
top LP002990 Male Yes 0 Graduate No
freq 1 489 398 345 480 500
mean NaN NaN NaN NaN NaN NaN
std NaN NaN NaN NaN NaN NaN
min NaN NaN NaN NaN NaN NaN
25% NaN NaN NaN NaN NaN NaN
50% NaN NaN NaN NaN NaN NaN
75% NaN NaN NaN NaN NaN NaN
max NaN NaN NaN NaN NaN NaN

ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term \
count 614.000000 614.000000 592.000000 600.00000
unique NaN NaN NaN NaN
top NaN NaN NaN NaN
freq NaN NaN NaN NaN
mean 5403.459283 1621.245798 146.412162 342.00000
std 6109.041673 2926.248369 85.587325 65.12041
min 150.000000 0.000000 9.000000 12.00000
25% 2877.500000 0.000000 100.000000 360.00000
50% 3812.500000 1188.500000 128.000000 360.00000
75% 5795.000000 2297.250000 168.000000 360.00000
max 81000.000000 41667.000000 700.000000 480.00000

Credit_History Property_Area Loan_Status
count 564.000000 614 614
unique NaN 3 2
top NaN Semiurban Y
freq NaN 233 422
mean 0.842199 NaN NaN
std 0.364878 NaN NaN
min 0.000000 NaN NaN
25% 1.000000 NaN NaN
50% 1.000000 NaN NaN
75% 1.000000 NaN NaN
max 1.000000 NaN NaN

In this step, always look for:

  1. Missing values
  2. Categorical variables
  3. Data types that might be wrong
  4. High cardinality in string columns
  5. Outliers or zero/negative values

Step 2: Handle Missing Values

Next, let’s summarize the missing data:

Credit_History      50
Self_Employed 32
LoanAmount 22
Dependents 15
Loan_Amount_Term 14
Gender 13
Married 3
dtype: int64

Here’s how to deal with the missing values one by one:

The median is robust to outliers, and the mode is perfect for majority categories in classification tasks.


Step 3: Fix Data Types

However, our dataset doesn’t require this step. But, sometimes numeric fields are stored as objects, or categorical fields are not cast properly. So, here’s how to fix data types if any:

This helps with memory optimization and model compatibility.


Step 4: Standardize Text Columns

Our dataset doesn’t require this step as well. But this step is necessary when you have extra spaces in the text columns:

Step 5: Handle Outliers

This step is optional in the data cleaning part. You can either handle outliers after exploring the data in detail or remove the outliers if the problem you are solving is sensitive to outliers.

For numerical columns like LoanAmount or ApplicantIncome, we can cap/floor them based on percentiles:

Once cleaned, you should save it for further steps:

Final Pipeline Function

Here’s how to modularize everything into a reusable function:

Here’s how to use this function:

So, this is how to build a data cleaning pipeline using Pandas.

Summary

So, even though not all datasets require machine learning, every dataset does demand proper cleaning, and this pipeline sets the foundation for accurate, reliable analysis and modelling. I hope you liked this article on building a data cleaning pipeline using Pandas.

No comments:

Post a Comment

Python using AI

  Python using AI - Prompts & Codes Tools useful for Python + AI ChatGPT - https://chatgpt.com/ Claude AI - https://claude.ai/new ...