Credit Analyzing: Data Cleaning Pipeline using Pandas

The use of Machine Learning Algorithms is not required for every dataset, but every dataset does require some data cleaning steps. So, building a data cleaning pipeline using Pandas is one of those bread-and-butter tasks every Data Analyst should master early on. In this article, I’ll take you through a tutorial on building a data cleaning pipeline using Pandas.

Data Cleaning Pipeline using Pandas

The dataset I will be using for this tutorial on building a data cleaning pipeline using Pandas is based on predicting loan approvals. You can download this dataset from here.

Let’s get started with this tutorial by importing the dataset:

import pandas as pd
df = pd.read_csv('/content/loan_prediction.csv')
​df.head()

Step 1: Understand the Data

Let’s explore the dataset to understand what needs fixing:

# shape of the dataset
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")

Rows: 614, Columns: 13

# summary of the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 67.2+ KB

# quick stats for numeric and object columns
print(df.describe(include='all'))

         Loan_ID Gender Married Dependents Education Self_Employed  \
count        614    601     611        599       614           582   
unique       614      2       2          4         2             2   
top     LP002990   Male     Yes          0  Graduate            No   
freq           1    489     398        345       480           500   
mean         NaN    NaN     NaN        NaN       NaN           NaN   
std          NaN    NaN     NaN        NaN       NaN           NaN   
min          NaN    NaN     NaN        NaN       NaN           NaN   
25%          NaN    NaN     NaN        NaN       NaN           NaN   
50%          NaN    NaN     NaN        NaN       NaN           NaN   
75%          NaN    NaN     NaN        NaN       NaN           NaN   
max          NaN    NaN     NaN        NaN       NaN           NaN   

        ApplicantIncome  CoapplicantIncome  LoanAmount  Loan_Amount_Term  \
count        614.000000         614.000000  592.000000         600.00000   
unique              NaN                NaN         NaN               NaN   
top                 NaN                NaN         NaN               NaN   
freq                NaN                NaN         NaN               NaN   
mean        5403.459283        1621.245798  146.412162         342.00000   
std         6109.041673        2926.248369   85.587325          65.12041   
min          150.000000           0.000000    9.000000          12.00000   
25%         2877.500000           0.000000  100.000000         360.00000   
50%         3812.500000        1188.500000  128.000000         360.00000   
75%         5795.000000        2297.250000  168.000000         360.00000   
max        81000.000000       41667.000000  700.000000         480.00000   

        Credit_History Property_Area Loan_Status  
count       564.000000           614         614  
unique             NaN             3           2  
top                NaN     Semiurban           Y  
freq               NaN           233         422  
mean          0.842199           NaN         NaN  
std           0.364878           NaN         NaN  
min           0.000000           NaN         NaN  
25%           1.000000           NaN         NaN  
50%           1.000000           NaN         NaN  
75%           1.000000           NaN         NaN  
max           1.000000           NaN         NaN

In this step, always look for:

Missing values
Categorical variables
Data types that might be wrong
High cardinality in string columns
Outliers or zero/negative values

Step 2: Handle Missing Values

Next, let’s summarize the missing data:

# count of missing values
missing = df.isnull().sum()
missing = missing[missing > 0].sort_values(ascending=False)
print(missing)

Credit_History      50
Self_Employed       32
LoanAmount          22
Dependents          15
Loan_Amount_Term    14
Gender              13
Married              3
dtype: int64

Here’s how to deal with the missing values one by one:

# fill categorical columns with mode
cat_cols = df.select_dtypes(include='object').columns
for col in cat_cols:
    if df[col].isnull().sum() > 0:
        df[col].fillna(df[col].mode()[0], inplace=True)
​
# fill numerical columns with median
num_cols = df.select_dtypes(include=['int64', 'float64']).columns
for col in num_cols:
    if df[col].isnull().sum() > 0:
        df[col].fillna(df[col].median(), inplace=True)

The median is robust to outliers, and the mode is perfect for majority categories in classification tasks.

Step 3: Fix Data Types

However, our dataset doesn’t require this step. But, sometimes numeric fields are stored as objects, or categorical fields are not cast properly. So, here’s how to fix data types if any:

# example: convert 'LoanAmount' to float
df['LoanAmount'] = df['LoanAmount'].astype(float)
​
# convert categorical variables to category type
for col in cat_cols:
    df[col] = df[col].astype('category')

This helps with memory optimization and model compatibility.

Step 4: Standardize Text Columns

Our dataset doesn’t require this step as well. But this step is necessary when you have extra spaces in the text columns:

# strip whitespace and convert to consistent case
for col in cat_cols:
    df[col] = df[col].str.strip().str.lower()

Step 5: Handle Outliers

This step is optional in the data cleaning part. You can either handle outliers after exploring the data in detail or remove the outliers if the problem you are solving is sensitive to outliers.

For numerical columns like LoanAmount or ApplicantIncome, we can cap/floor them based on percentiles:

# capping outliers at 99th percentile
for col in ['LoanAmount', 'ApplicantIncome']:
    upper = df[col].quantile(0.99)
    df[col] = df[col].apply(lambda x: upper if x > upper else x)

Once cleaned, you should save it for further steps:

df.to_csv('cleaned_loan_data.csv', index=False)

Final Pipeline Function

Here’s how to modularize everything into a reusable function:

def clean_loan_data(df):
    # fill missing values
    for col in df.select_dtypes(include='object'):
        df[col].fillna(df[col].mode()[0], inplace=True)
    for col in df.select_dtypes(include=['int64', 'float64']):
        df[col].fillna(df[col].median(), inplace=True)
​
    # type conversions
    for col in df.select_dtypes(include='object'):
        df[col] = df[col].astype('category')
        df[col] = df[col].str.strip().str.lower()
​
    # outlier capping
    for col in ['LoanAmount', 'ApplicantIncome']:
        if col in df.columns:
            upper = df[col].quantile(0.99)
            df[col] = df[col].apply(lambda x: upper if x > upper else x)
​
    return df

Here’s how to use this function:

df = pd.read_csv('/content/loan_prediction.csv')
df_clean = clean_loan_data(df)

So, this is how to build a data cleaning pipeline using Pandas.

Summary

So, even though not all datasets require machine learning, every dataset does demand proper cleaning, and this pipeline sets the foundation for accurate, reliable analysis and modelling. I hope you liked this article on building a data cleaning pipeline using Pandas.

Credit Analyzing

Friday, April 18, 2025

Data Cleaning Pipeline using Pandas

Data Cleaning Pipeline using Pandas

Step 1: Understand the Data

Step 2: Handle Missing Values

Step 3: Fix Data Types

Step 4: Standardize Text Columns

Step 5: Handle Outliers

Final Pipeline Function

Summary

No comments:

Post a Comment

functions

Report Abuse