Data science | Data preprocessing
A simple definition of data preprocessing could be that data preprocessing is a data mining technique to turn the raw data gathered from diverse sources into cleaner information that’s more suitable for work. In other words, it’s a preliminary step that takes all of the available information to organize it, sort it, and merge it.
There are a lot of preprocessing methods but we will mainly focus on the following methodologies:
(1) Encoding the Data
(2) Normalization
(3) Standardization
(4) Imputing the Missing Values
(5) Discretization
Dataset Description
The IRIS dataset has the width and length of petal and sepal with flower species name such as (Iris-setosa, Iris-versicolor, Iris-virginica).
Here, we can observe that this dataset contains 5 columns with numeric value and 1 column(species) has object as type. In 5 columns first column is Id and other 4 columns have width and length of Sepal and Petal.
Encoding the Data
Encoding is the conversion of Categorical features to numeric values as Machine Learning models cannot handle the text data directly. Most of the Machine Learning Algorithms performance vary based on the way in which the Categorical data is encoded. The two popular techniques of converting Categorical values to Numeric values are done in two different methods.
- Label Encoding
- One Hot Encoding
Label Encoding
Label Encoding refers to converting the labels into numeric form so as to convert it into the machine-readable form. Machine learning algorithms can then decide in a better way on how those labels must be operated. It is an important pre-processing step for the structured dataset in supervised learning.
As you can see ‘Species’ column has 3 categories of flower. After Using Label Encoder we labeled the data.
One Hot Encoding
Though label encoding is straight but it has the disadvantage that the numeric values can be misinterpreted by algorithms as having some sort of hierarchy/order in them. This ordering issue is addressed in another common alternative approach called ‘One-Hot Encoding’. In this strategy, each category value is converted into a new column and assigned a 1 or 0 (notation for true/false) value to the column.
Normalization
Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1. It is also known as Min-Max scaling.
Here’s the formula for normalization:
Here, Xmax and Xmin are the maximum and the minimum values of the feature respectively.
- When the value of X is the minimum value in the column, the numerator will be 0, and hence X’ is 0
- On the other hand, when the value of X is the maximum value in the column, the numerator is equal to the denominator and thus the value of X’ is 1
- If the value of X is between the minimum and the maximum value, then the value of X’ is between 0 and 1
Standardization
Standardization is another scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation.
Here’s the formula for standardization:
μ is the mean of the feature values and σ is the standard deviation of the feature values. Note that in this case, the values are not restricted to a particular range.
Imputing the Missing Values
Missing data are values that are not recorded in a dataset. They can be a single value missing in a single cell or missing an entire observation (row). Missing data can occur both in a continuous variable (e.g. height of students) or a categorical variable (e.g. gender of a population).
We can handle missing values in two ways:
- Remove the data (whole row) which have missing values.
- Add the values by using some strategies or using Imputer
Discretization
Data discretization refers to a method of converting a huge number of data values into smaller ones so that the evaluation and management of data become easy. Discretization (otherwise known as quantization or binning) provides a way to partition continuous features into discrete values. Certain datasets with continuous features may benefit from discretization, because discretization can transform the dataset of continuous attributes to one with only nominal attributes.
Visit the code by below link: