Data Cleaning and Preparation for Machine Learning

The importance of cleaning data for machine learning, and how to do it yourself

Data cleaning and preparation is a vital, important first step when starting a machine learning project. Scientists spend much of their time cleaning data as opposed to messing around with algorithms and models.

In this article we’ll walk you through data cleaning in Python, examining data sets, selecting columns for features, exploring data visually, and encoding the features for machine learning. Utilizing a Jupyter Notebook in this tutorial is suggested.

For more about data cleaning, check out these courses:

With so much data, you may get overwhelmed. Don’t worry- there are ways to make data cleaning easier.

Understanding the Data

Prior to cleaning data for machine learning, you must understand what your data is and what you want to accomplish with it

Loans that are approved are listed on their website. When an investor decides to fund a loan, the borrower pays monthly installments to Lending Club, where Lending Club then redistributes these to investors. Investors don’t have to wait until the full amount is paid to start seeing returns- a win for everyone!

Some borrowers, however, default on their loan. We’ll address this problem as we clean some data from Lending Club for the purposes of this tutorial on cleaning for machine learning. Pretend we have been asked to build a model to predict whether borrowers are likely to pay off or default their loans.

How to Clean your Data for Machine Learning

Step 1: Examine the Data Set

Lending Club releases data for it’s approved and declined loans on their website. To make sure we are using the same data sets, we have mirrored the data for this tutorial on data.world.

On the Lending Club website you have the option to choose different year ranges to download data sets for all types of loans. There’s a data dictionary on the bottom of the Lending Club page as well, which has information on each different column names. This dictionary is useful for knowing what each column represents. It contains two sheets:

LoanStats sheet: describes the approved loans dataset

RejectStats sheet: describes the rejected loans dataset

We will use the LoanStats sheet for the purpose of this tutorial. The approved loans data has information on current, completed, and defaulted loans. We’ll use the approved loans data from 2007 to 2011, but similar steps would be needed to clean any of the data on their site.

First we import the libraries we’ll be using and set parameters for ease of reading. We’ll assume a solid grasp of the fundamentals of working with data in Python, so if you need a refresh on those skills you can browse our course listings.

import pandas as pd

import numpy as np

pd.set_option('max_columns', 120)

pd.set_option('max_colwidth', 5000)

import matplotlib.pyplot as plt

import seaborn as sns

%matplotlib inline

plt.rcParams['figure.figsize'] = (12,8)

Now we’ll load the data into pandas. We’ve named our downloaded data set lending_club_loans.csv. We need to load it into pandas DataFrame in order to explore it. We’ll do some easy cleaning tasks to remove info we don’t need that will slow our process down.

We will:

Remove the first line, as it’s extraneous.
Remove the ‘desc’ column, also extraneous.
Remove the ‘url’ column, that contains a link that can only be accessed with an investor account.
Remove all columns with more than 50% missing values, allowing us to work faster.

Name the filtered data set loans_2007, and at the end of this section save it as loans_2007.csv to separate it from the raw data. This is a good practice to follow, as it keeps your original data in case you need to retrieve something later.

Perform these steps:

# skip row 1 so pandas can parse the data properly.

loans_2007 = pd.read_csv('data/lending_club_loans.csv', skiprows=1, low_memory=False)

half_count = len(loans_2007) / 2

loans_2007 = loans_2007.dropna(thresh=half_count,axis=1) # Drop any column with more than 50% missing values

loans_2007 = loans_2007.drop(['url','desc'],axis=1) # These columns are not useful for our purposes

Use the pandas head() method to display the first three rows of the loans_2007 DataFrame.

loans_2007.head(3)

idmember_idloan_amntfunded_amntfunded_amnt_invtermint_rateinstallmentgradesub_gradeemp_titleemp_lengthhome_ownershipannual_incverification_statusissue_dloan_statuspymnt_planpurposetitlezip_codeaddr_statedtidelinq_2yrsearliest_cr_linefico_range_lowfico_range_highinq_last_6mthsopen_accpub_recrevol_balrevol_utiltotal_accinitial_list_statusout_prncpout_prncp_invtotal_pymnttotal_pymnt_invtotal_rec_prncptotal_rec_inttotal_rec_late_feerecoveriescollection_recovery_feelast_pymnt_dlast_pymnt_amntlast_credit_pull_dlast_fico_range_highlast_fico_range_lowcollections_12_mths_ex_medpolicy_codeapplication_typeacc_now_delinqchargeoff_within_12_mthsdelinq_amntpub_rec_bankruptciestax_liens010775011296599.05000.05000.04975.036 months10.65%162.87BB2NaN10+ yearsRENT24000.0VerifiedDec-2011Fully Paidncredit_cardComputer860xxAZ27.650.0Jan-1985735.0739.01.03.00.013648.083.7%9.0f0.00.05863.1551875833.845000.00863.160.00.000.00Jan-2015171.62Sep-2016744.0740.00.01.0INDIVIDUAL0.00.00.00.00.0110774301314167.02500.02500.02500.060 months15.27%59.83CC4Ryder< 1 yearRENT30000.0Source VerifiedDec-2011Charged Offncarbike309xxGA1.000.0Apr-1999740.0744.05.03.00.01687.09.4%4.0f0.00.01008.7100001008.71456.46435.170.0117.081.11Apr-2013119.66Sep-2016499.00.00.01.0INDIVIDUAL0.00.00.00.00.0210771751313524.02400.02400.02400.036 months15.96%84.33CC5NaN10+ yearsRENT12252.0Not VerifiedDec-2011Fully Paidnsmall_businessreal estate business606xxIL8.720.0Nov-2001735.0739.02.02.00.02956.098.5%10.0f0.00.03005.6668443005.672400.00605.670.00.000.00Jun-2014649.91Sep-2016719.0715.00.01.0INDIVIDUAL0.00.00.00.00.0

Use pandas .shape attribute to view the number of samples and features we’re dealing with at this stage:

loans_2007.shape(42538, 56)

Step 2: Narrow Down Columns for Cleaning

Our data is now all set up, so it’s time to explore and understand what feature each column represents. Having a bad understanding of these features could cause us to make mistakes down the road, which would impede on our data analysis and modeling process.

We’ll use the LendingClub data dictionary. Create a DataFrame to contain the names of the columns, data type, first row’s values, and description from the dictionary. We’ve pre-converted the data dictionary from Excel format to a CSV. Load it and take a look.

data_dictionary = pd.read_csv('LCDataDictionary.csv') # Loading in the data dictionary print(data_dictionary.shape[0])

print(data_dictionary.columns.tolist())117

['LoanStatNew', 'Description']data_dictionary.head()

data_dictionary = data_dictionary.rename(columns={'LoanStatNew': 'name', 'Description': 'description'})

LoanStatNewDescription0acc_now_delinqThe number of accounts on which the borrower is now delinquent.1acc_open_past_24mthsNumber of trades opened in past 24 months.2addr_stateThe state provided by the borrower in the loan application3all_utilBalance to credit limit on all trades4annual_incThe self-reported annual income provided by the borrower during registration.

Join the first row of loans_2007 to the data_dictionary DataFrame to give a preview DataFrame with the following columns:

name — contains the column names of loans_2007.
dtypes — contains the data types of the loans_2007 columns.
first value — contains the values of loans_2007 first row.
description — explains what each column in loans_2007 represents.

loans_2007_dtypes = pd.DataFrame(loans_2007.dtypes,columns=['dtypes'])

loans_2007_dtypes = loans_2007_dtypes.reset_index()

loans_2007_dtypes['name'] = loans_2007_dtypes['index']

loans_2007_dtypes = loans_2007_dtypes[['name','dtypes']]

loans_2007_dtypes['first value'] = loans_2007.loc[0].values

preview = loans_2007_dtypes.merge(data_dictionary, on='name',how='left')preview.head()

When we printed the shape of loans_2007, it had 56 columns, so we know that this preview DataFrame has 56 rows, one explaining each column in loans_2007.

It can be a lot to try to explore all the rows at once. We will break it into three parts and look at smaller parts each time. We want to pay attention to any column that:

Leaks info from the future, after the loan has been funded
Doesn’t affect the borrower’s ability to pay back the loan
Is formatted poorly
Requires more data or a lot of pre-processing to be useful
Contains redundant info

We need to focus on data leakage, which can make the model overfit. The model would be learning from features that won’t be available when we make future loan predictions, so we need to make sure our model is trained using only available data from the point of a loan application.

Now we’ll display the first 19 rows and analyze them.

preview[:19]

namedtypesfirst valuedescription0idobject1077501A unique LC assigned ID for the loan listing.1member_idfloat641.2966e+06A unique LC assigned Id for the borrower member.2loan_amntfloat645000The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.3funded_amntfloat645000The total amount committed to that loan at that point in time.4funded_amnt_invfloat644975The total amount committed by investors for that loan at that point in time.5termobject36 monthsThe number of payments on the loan. Values are in months and can be either 36 or 60.6int_rateobject10.65%Interest Rate on the loan7installmentfloat64162.87The monthly payment owed by the borrower if the loan originates.8gradeobjectBLC assigned loan grade9sub_gradeobjectB2LC assigned loan subgrade10emp_titleobjectNaNThe job title supplied by the Borrower when applying for the loan.*11emp_lengthobject10+ yearsEmployment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.12home_ownershipobjectRENTThe home ownership status provided by the borrower during registration. Our values are: RENT, OWN, MORTGAGE, OTHER.13annual_incfloat6424000The self-reported annual income provided by the borrower during registration.14verification_statusobjectVerifiedIndicates if income was verified by LC, not verified, or if the income source was verified15issue_dobjectDec-2011The month which the loan was funded16loan_statusobjectFully PaidCurrent status of the loan17pymnt_planobjectnIndicates if a payment plan has been put in place for the loan18purposeobjectcredit_cardA category provided by the borrower for the loan request.

We can conclude from analyzing this that the following things must be removed:

id — randomly generated field by LendingClub for unique identification purposes only.
member_id — randomly generated field by LendingClub for identification purposes only.
funded_amnt — leaks information from the future
funded_amnt_inv — leaks data from the future.
sub_grade — contains redundant information that is already in the grade column
int_rate — also included within the grade column.
emp_title — requires other data and a lot of processing to become potentially useful
issued_d — leaks data from the future.

Lending Club uses a borrower’s grade and payment term to assess interest rates. This causes variations in interest rate within a grade.

For our model it’s useful to focus on clusters of borrowers over individuals. That’s what grading does- it segments borrowers based on behaviors, which is why we’ll keep the grade column and drop interest in int_rate and sub_grade. Drop these from the DatFrame before moving to the next group of columns.

The next 19 columns:

preview[19:38]

namedtypesfirst valuedescription19titleobjectComputerThe loan title provided by the borrower20zip_codeobject860xxThe first 3 numbers of the zip code provided by the borrower in the loan application.21addr_stateobjectAZThe state provided by the borrower in the loan application22dtifloat6427.65A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.23delinq_2yrsfloat640The number of 30+ days past-due incidences of delinquency in the borrower’s credit file for the past 2 years24earliest_cr_lineobjectJan-1985The month the borrower’s earliest reported credit line was opened25fico_range_lowfloat64735The lower boundary range the borrower’s FICO at loan origination belongs to.26fico_range_highfloat64739The upper boundary range the borrower’s FICO at loan origination belongs to.27inq_last_6mthsfloat641The number of inquiries in past 6 months (excluding auto and mortgage inquiries)28open_accfloat643The number of open credit lines in the borrower’s credit file.29pub_recfloat640Number of derogatory public records30revol_balfloat6413648Total credit revolving balance31revol_utilobject83.7%Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.32total_accfloat649The total number of credit lines currently in the borrower’s credit file33initial_list_statusobjectfThe initial listing status of the loan. Possible values are – W, F34out_prncpfloat640Remaining outstanding principal for total amount funded35out_prncp_invfloat640Remaining outstanding principal for portion of total amount funded by investors36total_pymntfloat645863.16Payments received to date for total amount funded37total_pymnt_invfloat645833.84Payments received to date for portion of total amount funded by investors

Take note of the fico_range_low and fico_range_high columns. We’ll talk more about them after looking at the last column group. If you’re working with newer data, it may not include FICO score data.

For this second group, we can refine by dropping these columns:

zip_code – mostly redundant with the addr_state column since only the first 3 digits of the 5 digit zip code are visible.
out_prncp – leaks data from the future.
out_prncp_inv – also leaks data from the future.
total_pymnt – also leaks data from the future.
total_pymnt_inv – also leaks data from the future.

Remove these 5 columns from the DataFrame:

drop_cols = [ 'zip_code','out_prncp','out_prncp_inv',

'total_pymnt','total_pymnt_inv']

loans_2007 = loans_2007.drop(drop_cols, axis=1)

Now onto the third group.

preview[38:]

namedtypesfirst valuedescription38total_rec_prncpfloat645000Principal received to date39total_rec_intfloat64863.16Interest received to date40total_rec_late_feefloat640Late fees received to date41recoveriesfloat640post charge off gross recovery42collection_recovery_feefloat640post charge off collection fee43last_pymnt_dobjectJan-2015Last month payment was received44last_pymnt_amntfloat64171.62Last total payment amount received45last_credit_pull_dobjectSep-2016The most recent month LC pulled credit for this loan46last_fico_range_highfloat64744The upper boundary range the borrower’s last FICO pulled belongs to.47last_fico_range_lowfloat64740The lower boundary range the borrower’s last FICO pulled belongs to.48collections_12_mths_ex_medfloat640Number of collections in 12 months excluding medical collections49policy_codefloat641publicly available policy_code=1\nnew products not publicly available policy_code=250application_typeobjectINDIVIDUALIndicates whether the loan is an individual application or a joint application with two co-borrowers51acc_now_delinqfloat640The number of accounts on which the borrower is now delinquent.52chargeoff_within_12_mthsfloat640Number of charge-offs within 12 months53delinq_amntfloat640The past-due amount owed for the accounts on which the borrower is now delinquent.54pub_rec_bankruptciesfloat640Number of public record bankruptcies55tax_liensfloat640Number of tax liens

In this last group of columns, we need to drop the following, all of which leak data from the future:

total_rec_prncp
total_rec_int
total_rec_late_fee
recoveries
collection_recovery_fee
last_pymnt_d
last_pymnt_amnt

Drop our last group of columns:

drop_cols = ['total_rec_prncp','total_rec_int',

'total_rec_late_fee','recoveries',

'collection_recovery_fee', 'last_pymnt_d'

'last_pymnt_amnt']

loans_2007 = loans_2007.drop(drop_cols, axis=1)

Now we have a cleaner, much more useful data set for the building of our model. All irrelevant data has been cleansed out of the system, which will allow it to run smoother.

Investigating FICO Score Columns

We will now discuss the fico_range_low, fico_range_high, last_fico_range_low, and last_fico_range_high columns so you can better understand them.

FICO scores are credit scores, a number used by banks and credit cards to represent how credit-worthy someone is. The FICO score is the most used credit score in the US. When a borrower applies for a loan, LendingClub gets the borrower’s credit score from FICO and they store those values as fico_range_low, fico_range_high. After that, any updates to the borrowers score are recorded as last_fico_range_low, and last_fico_range_high.

Understanding the data you’re using is so important. The original author of this article found a Stanford University project in 2014 while researching this data set. In the project, the group listed the current credit score among late fees and recovery fees as fields they added to the features by mistake. They claimed they later learned these columns leak future information.

Another group from Stanford used our same Lending Club dataset. Using the FICO columns, the group dropped only last_fico_range_low in their model. The second group’s findings reported that the last_fico_range_high was one of the more important features in finding accurate results.

So that leads us to the question- do the FICO credit scores leak information from the future? A column leaks information when the data it holds isn’t available when using the model to predict. In this specific case, that would be when we use the model on future loan applications to try to predict borrower behavior.

For more in-depth information on FICO scores for LendingClub loans, check out this article. The post notes that while noticing the trend of FICO scores is a good predictor of whether a loan will default, LendingClub continues to update FICO scores after a loan is funded. We can use the first FICO scores (fico_range_low and fico_range_high) but we cannot use last_fico_range_low and last_fico_range_high, because LendingClub may update these after the borrower submits their application.

These are the values in the two columns we can use:

print(loans_2007['fico_range_low'].unique())

print(loans_2007['fico_range_high'].unique())[ 735. 740. 690. 695. 730. 660. 675. 725. 710. 705. 720. 665. 670. 760. 685. 755. 680. 700. 790. 750. 715. 765. 745. 770. 780. 775. 795. 810. 800. 815. 785. 805. 825. 820. 630. 625. nan 650. 655. 645. 640. 635. 610. 620. 615.]

[ 739. 744. 694. 699. 734. 664. 679. 729. 714. 709. 724. 669. 674. 764. 689. 759. 684. 704. 794. 754. 719. 769. 749. 774. 784. 779. 799. 814. 804. 819. 789. 809. 829. 824. 634. 629. nan 654. 659. 649. 644. 639. 614. 624. 619.]

Let’s get rid of the missing values.

fico_columns = ['fico_range_high','fico_range_low']

print(loans_2007.shape[0])

loans_2007.dropna(subset=fico_columns,inplace=True)

print(loans_2007.shape[0])

loans_2007[fico_columns].plot.hist(alpha=0.5,bins=20);42538

42535

Now create a column for the average of fico_range_low and fico_range_high columns. Name it fico_average. This is not the average FICO score for each borrower, but rather an average of the high and low range that we know the borrower is in.

loans_2007['fico_average'] = (loans_2007['fico_range_high'] + loans_2007['fico_range_low']) / 2

Let’s check what we just did.

cols = ['fico_range_low','fico_range_high','fico_average']

loans_2007[cols].head()

fico_range_lowfico_range_highfico_average0735.0739.0737.01740.0744.0742.02735.0739.0737.03690.0694.0692.04695.0699.0697.0

Good! We got the mean calculations and everything right. Now, we can go ahead and drop fico_range_low, fico_range_high, last_fico_range_low, and last_fico_range_high columns.

drop_cols = ['fico_range_low','fico_range_high','last_fico_range_low', 'last_fico_range_high']

loans_2007 = loans_2007.drop(drop_cols, axis=1)

loans_2007.shape(42535, 33)

By analyzing this data and the columns, we have reduced the number of columns without losing any important data for our model. This is why data cleaning is so important.

Pick a Target Column

We must decide on the right column to use as a target column. Our goal is to predict who will pay off their loan and who will default, and we need to find a column that reflects this. We learned from the description of columns in the preview DataFrame that loan_status is the only field in the main data set that describes a loan status, so we’ll use this column as the target column.

preview[preview.name == 'loan_status']

namedtypesfirst valuedescription16loan_statusobjectFully PaidCurrent status of the loan

This column contains text values that need to be converted to numerical values to be able to use for training a model. We’ll use the DataFrame method value_counts() to return the frequency of the unique values in the loan_status column.

loans_2007["loan_status"].value_counts()Fully Paid 33586

Charged Off 5653

Does not meet the credit policy. Status:Fully Paid 1988

Does not meet the credit policy. Status:Charged Off 761

Current 513

In Grace Period 16

Late (31-120 days) 12

Late (16-30 days) 5

Default 1

Name: loan_status, dtype: int64

The loan status has nine different possible values. We can read about most of the different loan statuses on the LendingClub website as well as posts on the Lend Academy and Orchard forums.

We’ll pull that data together in a table below so we can see the unique values, their frequency in the data set, and get a clearer idea of what each means:

meaning = [

"Loan has been fully paid off.",

"Loan for which there is no longer a reasonable expectation of further payments.",

"While the loan was paid off, the loan application today would no longer meet the credit policy and wouldn't be approved on to the marketplace.",

"While the loan was charged off, the loan application today would no longer meet the credit policy and wouldn't be approved on to the marketplace.",

"Loan is up to date on current payments.",

"The loan is past due but still in the grace period of 15 days.",

"Loan hasn't been paid in 31 to 120 days (late on the current payment).",

"Loan hasn't been paid in 16 to 30 days (late on the current payment).",

"Loan is defaulted on and no payment has been made for more than 121 days."]

status, count = loans_2007["loan_status"].value_counts().index, loans_2007["loan_status"].value_counts().values

loan_statuses_explanation = pd.DataFrame({'Loan Status': status,'Count': count,'Meaning': meaning})[['Loan Status','Count','Meaning']]

loan_statuses_explanation

Loan StatusCountMeaning0Fully Paid33586Loan has been fully paid off.1Charged Off5653Loan for which there is no longer a reasonable expectation of further payments.2Does not meet the credit policy. Status:Fully Paid1988While the loan was paid off, the loan application today would no longer meet the credit policy and wouldn’t be approved on to the marketplace.3Does not meet the credit policy. Status:Charged Off761While the loan was charged off, the loan application today would no longer meet the credit policy and wouldn’t be approved on to the marketplace.4Current513Loan is up to date on current payments.5In Grace Period16The loan is past due but still in the grace period of 15 days.6Late (31-120 days)12Loan hasn’t been paid in 31 to 120 days (late on the current payment).7Late (16-30 days)5Loan hasn’t been paid in 16 to 30 days (late on the current payment).8Default1Loan is defaulted on and no payment has been made for more than 121 days.

Our goal is to create a machine learning model that can learn from past loans and predict what loans will be paid off or not. In the above table, solely the Fully Paid and Charged Off values describe the final loan outcome. The other values describe still ongoing loans.

The Default status resembles the Charged Off status but to LendingClub, loans that are charged off have no chance of being repaid. Defaulted loans have a smaller chance. We should use only samples where the loan_status column is 'Fully Paid' or 'Charged Off'.

We want to be able to predict which of 'Fully Paid' or 'Charged Off' a loan will fall under so we can treat this as binary classification. Remove all the loans that don’t contain 'Fully Paid' or 'Charged Off' as the loan’s status and change the 'Fully Paid' values to 1 for positive cases and the 'Charged Off' values to 0 for negative cases.

Out of the ~42,000 rows we’ll thus remove just over 3,000.

There are multiple ways to transform all of the values in a column. We’ll use the DataFrame method replace().

loans_2007 = loans_2007[(loans_2007["loan_status"] == "Fully Paid") |

(loans_2007["loan_status"] == "Charged Off")]

mapping_dictionary = {"loan_status":{ "Fully Paid": 1, "Charged Off": 0}}

loans_2007 = loans_2007.replace(mapping_dictionary)

Visualize the Target Column Outcomes

fig, axs = plt.subplots(1,2,figsize=(14,7))

sns.countplot(x='loan_status',data=filtered_loans,ax=axs[0])

axs[0].set_title("Frequency of each Loan Status")

filtered_loans.loan_status.value_counts().plot(x=None,y=None, kind='pie', ax=axs[1],autopct='%1.2f%%')

axs[1].set_title("Percentage of each Loan status")

plt.show()

These plots indicate that a significant number of borrowers in our data set paid off their loan. We are more interested in identifying these “defaulters,” as in this model we’re trying to maximize investment returns.

Not lending to people would aid in increasing returns, so we’ll clean our data while focusing on building a model that can identify likely defaulters at the application point.

Remove the Columns with Only One Value

Finally, let’s find columns that have only one unique value and remove them. These columns won’t be useful, as they don’t add any information. This will make the next steps easier as well.

The pandas Series method nunique() returns the number of unique values, excluding any null values. We can use this method across the data set to remove these columns in one step.

loans_2007 = loans_2007.loc[:,loans_2007.apply(pd.Series.nunique) != 1]

There may be some columns with more than one unique value, but one value that has insignificant frequency in the data set. Find and drop any columns with unique values that show up less than four times:

for col in loans_2007.columns:

if (len(loans_2007[col].unique()) < 4):

print(loans_2007[col].value_counts())

print()36 months 29096

60 months 10143

Name: term, dtype: int64

Not Verified 16845

Verified 12526

Source Verified 9868

Name: verification_status, dtype: int64

1 33586

0 5653

Name: loan_status, dtype: int64

n 39238

y 1

Name: pymnt_plan, dtype: int64

The payment plan column (pymnt_plan) has two unique values, 'y' and 'n', with 'y' occurring only once. Let’s drop this column:

print(loans_2007.shape[1])

loans_2007 = loans_2007.drop('pymnt_plan', axis=1)

print("We've been able to reduce the features to => {}".format(loans_2007.shape[1]))25

We've been able to reduce the features to => 24

Lastly, let’s use pandas to save our freshly-cleaned DataFrame as a CSV file:

loans_2007.to_csv("processed_data/filtered_loans_2007.csv",index=False)

Our data set is now much cleaner and easier to work with. This isn’t the final step though!

Step 3: Preparing the Features for Machine Learning

Now we’ll prepare the filtered_loans_2006.csv data for machine learning. We will handle missing values, convert categorical columns to numeric columns, and remove any other columns we don’t need.

Before feeding our data into a machine learning algorithm, we must handle missing values and categorical features. The mathematics underlying most machine learning models assumes that the data is numerical and contains no missing values. To reinforce this requirement, scikit-learn will return an error if you try to train a model using data that contain missing values or non-numeric values when working with models like linear regression and logistic regression.

Here’s what we’ll be doing in this step:

Handle Missing Values
Investigate Categorical Columns
Convert Categorical Columns To Numeric Features
Map Ordinal Values To Integers
Encode Nominal Values As Dummy Variables

First we load in our data from last section’s final output:

filtered_loans = pd.read_csv('processed_data/filtered_loans_2007.csv')

print(filtered_loans.shape)

filtered_loans.head()(39239, 24)

loan_amntterminstallmentgradeemp_lengthhome_ownershipannual_incverification_statusloan_statuspurposetitleaddr_statedtidelinq_2yrsearliest_cr_lineinq_last_6mthsopen_accpub_recrevol_balrevol_utiltotal_acclast_credit_pull_dpub_rec_bankruptciesfico_average05000.036 months162.87B10+ yearsRENT24000.0Verified1credit_cardComputerAZ27.650.0Jan-19851.03.00.013648.083.7%9.0Sep-20160.0737.012500.060 months59.83C< 1 yearRENT30000.0Source Verified0carbikeGA1.000.0Apr-19995.03.00.01687.09.4%4.0Sep-20160.0742.022400.036 months84.33C10+ yearsRENT12252.0Not Verified1small_businessreal estate businessIL8.720.0Nov-20012.02.00.02956.098.5%10.0Sep-20160.0737.0310000.036 months339.31C10+ yearsRENT49200.0Source Verified1otherpersonelCA20.000.0Feb-19961.010.00.05598.021%37.0Apr-20160.0692.045000.036 months156.46A3 yearsRENT36000.0Source Verified1weddingMy wedding loan I promise to pay backAZ11.200.0Nov-20043.09.00.07963.028.3%12.0Jan-20160.0732.0

Handle Missing Values

We will compute the number of missing values and determine how to handle them. We can return the number of missing values across the DataFrame by doing this:

Use the Pandas DataFrame method isnull() to return a DataFrame containing Boolean values:
True if the original value is null
False if the original value isn’t null
Use the Pandas DataFrame method sum() to calculate the number of null values in each column.

null_counts = filtered_loans.isnull().sum()

print("Number of null values in each column:\n{}".format(null_counts))Number of null values in each column:

loan_amnt 0

term 0

installment 0

grade 0

emp_length 0

home_ownership 0

annual_inc 0

verification_status 0

loan_status 0

purpose 0

title 10

addr_state 0

dti 0

delinq_2yrs 0

earliest_cr_line 0

inq_last_6mths 0

open_acc 0

pub_rec 0

revol_bal 0

revol_util 50

total_acc 0

last_credit_pull_d 2

pub_rec_bankruptcies 697

fico_average 0

dtype: int64

While most of the columns have 0 missing values, title has 9 missing values, revol_util has 48, and pub_rec_bankruptcies contains 675 rows with missing values.

Remove columns entirely where more than 1% (392) of the rows for that column contain a null value. We’ll also remove the remaining rows containing null values. We’ll lose some data by doing this but we’ll be able to keep extra features to use for prediction (as we won’t have to drop those columns).

Keep the title and revol_util columns, removing rows containing missing values. Drop the pub_rec_bankruptcies column since more than 1% of the rows have a missing value for this column.

Here’s what to do:

Use the drop method to remove the pub_rec_bankruptcies column from filtered_loans.
Use the dropna method to remove all rows from filtered_loans containing any missing values.

In code, that looks like:

filtered_loans = filtered_loans.drop("pub_rec_bankruptcies",axis=1)

filtered_loans = filtered_loans.dropna()

There are, of course, many ways to deal with missing values. This is a vital step in data cleaning for machine learning. For our purposes in this tutorial we are set with this step, but if you want to learn more about handling missing values when cleaning data try our Data Cleaning Advanced course for Python, linked at the top of this article.

Investigate Categorical Columns

We want to end here with a data set ready for machine learning. This data must have no missing values and all values in columns must be numeric. We’ve already handled the missing values, so now we must find the number of columns that are of the object data type so we can make those values numeric.

print("Data types and their frequency\n{}".format(filtered_loans.dtypes.value_counts()))Data types and their frequency

float64 11

object 11

int64 1

dtype: int64

We have 11 object columns that contain text that must be converted into numeric features. Select solely the object columns using the DataFrame method select_dtype. Display a sample row to get a better sense of how the values in each column are formatted.

object_columns_df = filtered_loans.select_dtypes(include=['object'])

print(object_columns_df.iloc[0])term 36 months

grade B

emp_length 10+ years

home_ownership RENT

verification_status Verified

purpose credit_card

title Computer

addr_state AZ

earliest_cr_line Jan-1985

revol_util 83.7%

last_credit_pull_d Sep-2016

Name: 0, dtype: object

Note that the revol_util column contains numeric values, but is formatted as an object column. We learned from the description of columns in the preview DataFrame that revol_util is a “revolving line utilization rate or the amount of credit the borrower is using relative to all available credit” (read more here). We must format revol_util as a numeric value. Here’s what to do:

Use the str.rstrip() string method to strip the right trailing percent sign (%).
On the resulting Series object, use the astype() method to convert to the type float.
Assign the new Series of float values back to the revol_util column in the filtered_loans.

filtered_loans['revol_util'] = filtered_loans['revol_util'].str.rstrip('%').astype('float')

Moving on, these columns seem to represent categorical values:

home_ownership — home ownership status, can only be 1 of 4 categorical values according to the data dictionary.
verification_status — indicates if income was verified by LendingClub.
emp_length — number of years the borrower was employed upon time of application.
term — number of payments on the loan, either 36 or 60.
addr_state — borrower’s state of residence.
grade — LC assigned loan grade based on credit score.
purpose — a category provided by the borrower for the loan request.
title — loan title provided the borrower.

To check, confirm by checking the number of unique values in each of them. Based on the first row’s values for purpose and title, it seems these two columns show the same information. We must explore their unique value counts to prove if this is true.

Note the first row’s values for both earliest_cr_line and last_credit_pull_d columns contain date values that would require a good amount of feature engineering for them to be potentially useful:

earliest_cr_line — The month the borrower’s earliest reported credit line was opened.
last_credit_pull_d — The most recent month LendingClub pulled credit for this loan.

For some analyses, doing this feature engineering might be worth the time, but for the purposes of this tutorial we’ll remove these date columns from the DataFrame.

First, let’s explore the unique value counts of the six columns that seem to contain categorical values:

cols = ['home_ownership', 'grade','verification_status', 'emp_length', 'term', 'addr_state']

for name in cols:

print(name,':')

print(object_columns_df[name].value_counts(),'\n')home_ownership :

RENT 18677

MORTGAGE 17381

OWN 3020

OTHER 96

NONE 3

Name: home_ownership, dtype: int64

grade :

B 11873

A 10062

C 7970

D 5194

E 2760

F 1009

G 309

Name: grade, dtype: int64

verification_status :

Not Verified 16809

Verified 12515

Source Verified 9853

Name: verification_status, dtype: int64

emp_length :

10+ years 8715

< 1 year 4542

2 years 4344

3 years 4050

4 years 3385

5 years 3243

1 year 3207

6 years 2198

7 years 1738

8 years 1457

9 years 1245

n/a 1053

Name: emp_length, dtype: int64

term :

36 months 29041

60 months 10136

Name: term, dtype: int64

addr_state :

CA 7019

NY 3757

FL 2831

TX 2693

NJ 1825

IL 1513

PA 1493

VA 1388

GA 1381

MA 1322

OH 1197

MD 1039

AZ 863

WA 830

CO 777

NC 772

CT 738

MI 718

MO 677

MN 608

NV 488

SC 469

WI 447

OR 441

AL 441

LA 432

KY 319

OK 294

KS 264

UT 255

AR 241

DC 211

RI 197

NM 187

WV 174

HI 170

NH 169

DE 113

MT 84

WY 83

AK 79

SD 61

VT 53

MS 19

TN 17

IN 9

ID 6

IA 5

NE 5

ME 3

Name: addr_state, dtype: int64

Most of these columns contain discrete categorical values we can encode as dummy variables and keep. The addr_state column, in contrast, has too many unique values, so it’s better for our purposes to lose it.

Next let’s look at the unique value counts for the purpose and title columns to decide which columns we should keep.

for name in ['purpose','title']:

print("Unique Values in column: {}\n".format(name))

print(filtered_loans[name].value_counts(),'\n')Unique Values in column: purpose

debt_consolidation 18355

credit_card 5073

other 3921

home_improvement 2944

major_purchase 2178

small_business 1792

car 1534

wedding 940

medical 688

moving 580

vacation 377

house 372

educational 320

renewable_energy 103

Name: purpose, dtype: int64

Unique Values in column: title

Debt Consolidation 2142

Debt Consolidation Loan 1670

Personal Loan 650

Consolidation 501

debt consolidation 495

Credit Card Consolidation 354

Home Improvement 350

Debt consolidation 331

Small Business Loan 317

Credit Card Loan 310

Personal 306

Consolidation Loan 255

Home Improvement Loan 243

personal loan 231

personal 217

Loan 210

Wedding Loan 206

Car Loan 198

consolidation 197

Other Loan 187

Credit Card Payoff 153

Wedding 152

Major Purchase Loan 144

Credit Card Refinance 143

Consolidate 126

Medical 120

Credit Card 115

home improvement 109

My Loan 94

Credit Cards 92

...

toddandkim4ever 1

Remainder of down payment 1

Building a Financial Future 1

Higher interest payoff 1

Chase Home Improvement Loan 1

Sprinter Purchase 1

Refi credit card-great payment record 1

Karen's Freedom Loan 1

Business relocation and partner buyout 1

Update My New House 1

tito 1

florida vacation 1

Back to 0 1

Bye Bye credit card 1

britschool 1

Consolidation 16X60 1

Last Call 1

Want to be debt free in "3" 1

for excellent credit 1

loaney 1

jamal's loan 1

Refying Lending Club-I LOVE THIS PLACE! 1

Consoliation Loan 1

Personal/ Consolidation 1

Pauls Car 1

Road to freedom loan 1

Pay it off FINALLY! 1

MASH consolidation 1

Destination Wedding 1

Store Charge Card 1

Name: title, dtype: int64

The purpose and title columns contain overlapping information, but the purpose column contains fewer discrete values and is cleaner, so we’ll keep that column and drop the title column.

Drop the columns we’ve decided not to keep so far:

drop_cols = ['last_credit_pull_d','addr_state','title','earliest_cr_line']

filtered_loans = filtered_loans.drop(drop_cols,axis=1)

Convert Categorical Columns to Numeric Features

First we need to understand the two types of categorical features in our dataset and how we can convert them to numerical features.

Ordinal values: these categorical values are in natural order. We can sort and/or order them in increasing or decreasing order. For example, we learned earlier that LendingClub grades loan applicants from A to , and assigns each applicant a corresponding interest rate. Grade A is least risky, grade B is riskier than A, and so on:

A < B < C < D < E < F < G ; where < means less risky than

Nominal Values: these are regular categorical values. You can’t order nominal values. For instance, while we can order loan applicants in the employment length column (emp_length) based on years spent in the workforce:

Year 1 < year 2 < year 3 . . . < year N

We cannot do that with the column purpose. It wouldn’t make sense to say:

Car < wedding < education < moving < house

Now in our dataset we have these columns:

Ordinal Values
grade
emp_length
Nominal Values _ home_ownership
verification_status
purpose
Term

There are different ways to handle these two types. To map the ordinal values to integers, we can use the pandas DataFrame method replace() to map both grade and emp_length to appropriate numeric values:

mapping_dict = {

"emp_length": {

"10+ years": 10,

"9 years": 9,

"8 years": 8,

"7 years": 7,

"6 years": 6,

"5 years": 5,

"4 years": 4,

"3 years": 3,

"2 years": 2,

"1 year": 1,

"< 1 year": 0,

"n/a": 0

"grade":{

"A": 1,

"B": 2,

"C": 3,

"D": 4,

"E": 5,

"F": 6,

"G": 7

}

filtered_loans = filtered_loans.replace(mapping_dict)

filtered_loans[['emp_length','grade']].head()

Emp_lengthgrade010210321033103431

Now let’s move on to the Nominal Values. Converting nominal features into numerical features requires encoding them as dummy variables. This will look like:

Use pandas’ get_dummies() method to return a new DataFrame containing a new column for each dummy variable.
Use the concat() method to add these dummy columns back to the original DataFrame.
Drop the original columns entirely using the drop method.

Now encode the nominal columns that we have in our data set:

nominal_columns = ["home_ownership", "verification_status", "purpose", "term"]

dummy_df = pd.get_dummies(filtered_loans[nominal_columns])

filtered_loans = pd.concat([filtered_loans, dummy_df], axis=1)

filtered_loans = filtered_loans.drop(nominal_columns, axis=1)filtered_loans.head()

loan_amntinstallmentgradeemp_lengthannual_incloan_statusdtidelinq_2yrsinq_last_6mthsopen_accpub_recrevol_balrevol_utiltotal_accfico_averagehome_ownership_MORTGAGEhome_ownership_NONEhome_ownership_OTHERhome_ownership_OWNhome_ownership_RENTverification_status_Not Verifiedverification_status_Source Verifiedverification_status_Verifiedpurpose_carpurpose_credit_cardpurpose_debt_consolidationpurpose_educationalpurpose_home_improvementpurpose_housepurpose_major_purchasepurpose_medicalpurpose_movingpurpose_otherpurpose_renewable_energypurpose_small_businesspurpose_vacationpurpose_weddingterm_ 36 monthsterm_ 60 months05000.0162.8721024000.0127.650.01.03.00.013648.083.79.0737.000001001010000000000001012500.059.833030000.001.000.05.03.00.01687.09.44.0742.000001010100000000000000122400.084.3331012252.018.720.02.02.00.02956.098.510.0737.0000011000000000000010010310000.0339.3131049200.0120.000.01.010.00.05598.021.037.0692.000001010000000000100001045000.0156.461336000.0111.200.03.09.00.07963.028.312.0732.0000010100000000000000110

Finally, let’s look at our final output from this step to make sure all the features are the same length, contain no null value, and are numerical. We’ll use pandas’s info method to inspect the filtered_loans DataFrame:

filtered_loans.info()<class 'pandas.core.frame.dataframe'="">

Int64Index: 39177 entries, 0 to 39238

Data columns (total 39 columns):

loan_amnt 39177 non-null float64

installment 39177 non-null float64

grade 39177 non-null int64

emp_length 39177 non-null int64

annual_inc 39177 non-null float64

loan_status 39177 non-null int64

dti 39177 non-null float64

delinq_2yrs 39177 non-null float64

inq_last_6mths 39177 non-null float64

open_acc 39177 non-null float64

pub_rec 39177 non-null float64

revol_bal 39177 non-null float64

revol_util 39177 non-null float64

total_acc 39177 non-null float64

fico_average 39177 non-null float64

home_ownership_MORTGAGE 39177 non-null uint8

home_ownership_NONE 39177 non-null uint8

home_ownership_OTHER 39177 non-null uint8

home_ownership_OWN 39177 non-null uint8

home_ownership_RENT 39177 non-null uint8

verification_status_Not Verified 39177 non-null uint8

verification_status_Source Verified 39177 non-null uint8

verification_status_Verified 39177 non-null uint8

purpose_car 39177 non-null uint8

purpose_credit_card 39177 non-null uint8

purpose_debt_consolidation 39177 non-null uint8

purpose_educational 39177 non-null uint8

purpose_home_improvement 39177 non-null uint8

purpose_house 39177 non-null uint8

purpose_major_purchase 39177 non-null uint8

purpose_medical 39177 non-null uint8

purpose_moving 39177 non-null uint8

purpose_other 39177 non-null uint8

purpose_renewable_energy 39177 non-null uint8

purpose_small_business 39177 non-null uint8

purpose_vacation 39177 non-null uint8

purpose_wedding 39177 non-null uint8

term_ 36 months 39177 non-null uint8

term_ 60 months 39177 non-null uint8

dtypes: float64(12), int64(3), uint8(24)

memory usage: 5.7 MB

</class>

We’re all set! Congrats, you’ve just cleaned a huge data set for machine learning and learned some data skills in the process. There’s still one more step we need to be done, though.

Save to CSV

Save, save, save! It’s always smart to store the final output of each part of your workflow in a separate csv file. This allows you to make changes in the processing flow without having to recalculate everything we’ve already done.

As we did before, we can store our DataFrame as a CSV using the handy pandas to_csv() function.

filtered_loans.to_csv("processed_data/cleaned_loans_2007.csv",index=False)

Now What? Next Steps

Here we have covered the steps you need to work through a large data set and clean and prep the data for a machine learning project. But there’s always more to learn, and many places to learn from.

If you are happy with your data cleaning skills and you want to work more with this data set, check out our interactive machine learning walkthrough course that covers the next steps in working with this Lending Club data.

For more information and learning opportunities about data cleaning, check out these courses below. Happy learning!

User Experience

5 Industries Leading The Charge In Smart Data Interpretation And Strategic Advantage

November 28, 2024

Data Cleaning and Preparation for Machine Learning

Understanding the Data

Prior to cleaning data for machine learning, you must understand what your data is and what you want to accomplish with it

How to Clean your Data for Machine Learning

Step 1: Examine the Data Set

Step 2: Narrow Down Columns for Cleaning

Investigating FICO Score Columns

Pick a Target Column

Visualize the Target Column Outcomes

Remove the Columns with Only One Value

Step 3: Preparing the Features for Machine Learning

Handle Missing Values

Investigate Categorical Columns

Convert Categorical Columns to Numeric Features

Save to CSV

Now What? Next Steps

5 Industries Leading The Charge In Smart Data Interpretation And Strategic Advantage

Newsletter