Day 2: Data Pre-processing

Preprocessing

Preprocessing is the process of converting the data into a suitable format ready to be used for modeling purpose. Preprocessing takes raw data as input and transforms it into a format that can be understood and analyzed by computers and machine learning.

Preprocessing is the most crucial step in data science as it decides the qulaity of the data.

“Garbage in, garbage out” is the most commonly used phrase in This means that if you use bad or “dirty” data to train your model, you’ll end up with a bad, improperly trained model that won’t actually be relevant to your analysis.

In contrast the actual phrase to be used for matching preprocessing should be quality in qulaity out. And the qulity output is decided at preprocessing itself.

Importance of Preprocessing

Powerful data make your model powerful

Good, preprocessed data is more important than the most powerful algorithms, to the point that machine learning models trained with bad data could actually be harmful to the analysis you’re trying to do – giving you “garbage” results.

Steps involved in prepocessing 

In order to ensure quality input following steps are to be followed as a part of pre-processing

Step 1: Data quality assessment

Step 2: Data cleaning

Step 3: Data transformation

Step 4: Data reduction

Step 1: Data quality assessment deals with assessing the primary quality of the data by ensuring various factors like count of missing values, uniformity in column values, and uniformity in column names.

When data is collected from various sources or when different users are involved in data collection/curation there may exist a mismatch in the rows and columns of data. Data quality assessment helps us identify the mismatch if any and helps us to build good quality data.

Data quality assessment includes finding duplicate instances and finding whether any nan values exist. Check the uniformity of data, for example in the case of numeric data number of digits after the decimal point. In case of categorical data using the same notation to denote the values let us say gender information, gender can be specified as should be uniformly Male and Female and not as he/she or M and F. But throughout the dataset, it should be uniformly maintained as Male and Female. Maintaining uniformity in column names, checking for redundant columns and dropping them, using common physical scales of measurements to specify values (Kg, Rs, Km, Mbps) etc.

Step 2: Data cleaning

In the quality assessment step if any mismatches are formed then it is handled in the data cleaning step.

The data cleaning step involves data scaling, handling noise through filtering, dealing with outliers, imputing missing values etc.

An outlier may occur due to the variability in the data, or due to experimental error/human error. Outliers affect the mean of the sample, thus it is important to handle the outliers.

The outlier can be dealt with by using scaling. However, Trimming/removing the outlier, Quantile based flooring, and capping (The data points that are lesser than the 10th percentile are replaced with the 10th percentile value, and the data points that are greater than the 90th percentile are replaced with 90th percentile value), and Mean/Median imputation are the three main ways of handling outliers.

Missing values can be imputed by following either of the processes that suits the best

replace by mean, replace by previous/next value, replace by most frequent value or drop the instances if they are minimal.

Step 3: Data transformation

Data transformation is the process of changing the shape of data from one form to another.

Techniques involved in data transformation include

Aggregation: Data aggregation deals with combining all the data collected from various resources in a uniform format. Normalization: Data normalization deals with scaling the data into a regularized range so that one can compare it more accurately. It is ineffective to compare the two digits age with 6 digits salary. Thus the numeric values are scaled under a certain range. For example: 0 to 1 scaling, min-max scaling, -1 to 1 scaling etc.

Feature selection: Feature selection is the process of identifying strong and relevant features that can improve the performance of the model.

Discretization: Discretization is the mechanism of segregating the data into various intervals. when the data is discretized into only two groups it is called binarization. It is usually achieved through entropy. Discretization can be achieved through Histogram analysis, Binning, Cluster Analysis etc.

Continuization: Continuization is the opposite of discretization. Here the categorical variables are provided with continuous values. Eg: One hot encoding, ordinal encoding, Dummy variable encoding.

One hot encoding is performed by first sorting the categories in alphabetical order and then representing each category with zero and one combination.

Eg: Let us say the gender feature has 3 categories (male, female, transgender) The one hot encodding is achieved in the following way: step1: Sort category of feature: Female-Male-Transgender step 2: Represent each sorted category in binary notation Female —> [1 0 0] Male —> [0 1 0] Transgender —> [0,0,1]

Ordinal encoding is numbering the categories of a variable that follow some order. Eg: Income column has 3 categories Low, mid and high and they share an ordered association hence Low is represented as 1, medium as 2, and high as 3 following an ascending order.

Dummy Variable Encoding: The one-hot encoding creates one binary variable for each category.

The problem is that this representation includes redundancy. For example, if we know that [1, 0, 0] represents “Female” and [0, 1, 0] represents “Male” we don’t need another binary variable to represent “transgender”, instead we could use 0 values for both “Female” and “Male” alone, e.g. [0, 0]. Which means ,if the gender is neither Female or Male then it is a transgender.

This is called a dummy variable encoding, and always represents C categories with C-1 binary variables. Which means three categories can be represented using 2 bit combination.

Using Dummy encoding the categories are represented as – Female —> [0. 1.] Male —> [1. 0.] Transgender —> [0. 0.]]

Step 4: Data reduction

Data reduction may involve Numerosity reduction, feature selection, and feature extraction.

Numerosity reduction deals with selecting only those rows and columns that can contribute positively to modeling. Usually, regression methods are used for performing numerosity reduction.

Dimensionality reduction: Dimensionality reduction is the process of reducing the columns of the dataset by following some selection criteria or by undergoing some transformation. Dimensionality reduction is achieved either through feature selection or feature extraction.

Feature selection: Based upon the significance value (rank) of the features, top N contributing features (subset) are selected as input to the model instead of giving the entire feature set as input.

Relief method, Correlation-based method, Chisquare, random forest-based method etc can be used for selecting the features.

Feature extraction: Feature extraction is a type of dimensionality reduction where we obtain reduced columns by projecting the data to other domains. That is by applying data transformation.

The main difference between feature selection and feature extraction is that in feature selection the original features do not undergo any changes. Their value remains the same. But in feature extraction, the data gets transformed and loss of data occurs.