To perform data preprocessing operations in R, including handling missing data and applying min-max normalization, you can follow the steps below using a sample dataset. I’ll use a built-in dataset (mtcars) as an example:
First, load the dataset into R.
data <- mtcars
head(data)
Let’s simulate missing data and then handle it.
Introduce some NA values for demonstration. Example: Introduce NAs in the mpg column
set.seed(123)
data[sample(1:nrow(data), 5), "mpg"] <- NA
Data after introducing the NA values in the dataset.
data
Check for missing data
summary(data)
mpg cyl disp
Min. :10.40 Min. :4.000 Min. : 71.1
1st Qu.:16.10 1st Qu.:4.000 1st Qu.:120.8
Median :19.20 Median :6.000 Median :196.3
Mean :20.34 Mean :6.188 Mean :230.7
3rd Qu.:22.15 3rd Qu.:8.000 3rd Qu.:326.0
Max. :33.90 Max. :8.000 Max. :472.0
NA's :5
hp drat wt
Min. : 52.0 Min. :2.760 Min. :1.513
1st Qu.: 96.5 1st Qu.:3.080 1st Qu.:2.581
Median :123.0 Median :3.695 Median :3.325
Mean :146.7 Mean :3.597 Mean :3.217
3rd Qu.:180.0 3rd Qu.:3.920 3rd Qu.:3.610
Max. :335.0 Max. :4.930 Max. :5.424
qsec vs am
Min. :14.50 Min. :0.0000 Min. :0.0000
1st Qu.:16.89 1st Qu.:0.0000 1st Qu.:0.0000
Median :17.71 Median :0.0000 Median :0.0000
Mean :17.85 Mean :0.4375 Mean :0.4062
3rd Qu.:18.90 3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :22.90 Max. :1.0000 Max. :1.0000
gear carb
Min. :3.000 Min. :1.000
1st Qu.:3.000 1st Qu.:2.000
Median :4.000 Median :2.000
Mean :3.688 Mean :2.812
3rd Qu.:4.000 3rd Qu.:4.000
Max. :5.000 Max. :8.000
There are several ways to handle missing data, such as removing rows with missing data, replacing with the mean/median/mode, or using more advanced methods.
Remove Rows with Missing Data: Remove rows with any missing values
data_cleaned <- na.omit(data)
Display the cleaned data after removing missing values.
data_cleaned
Replace Missing Values with Mean: Replace NA with mean of the column
data$mpg[is.na(data$mpg)] <- mean(data$mpg, na.rm = TRUE)
Display the cleaned data after replacing NA with mean of the column.
data
Min-max normalization scales the data to a range between 0 and 1.
Different ranges: Features in a dataset may have different units or scales (e.g., age might range from 0 to 100, while income could range from thousands to millions). This inconsistency can lead to problems, especially for algorithms that rely on distance calculations or assume a uniform scale for all features.
Normalization rescales the features to a consistent range, typically between 0 and 1, or sometimes -1 to 1, ensuring that no single feature dominates the learning process because of its scale.
Imagine a dataset with two features: age (ranging from 18 to 80) and income (ranging from $10,000 to $100,000). Without normalization, algorithms might prioritize income simply because its scale is much larger, leading to biased and potentially inaccurate predictions. Min-max normalization scales both features to a range of 0 to 1, treating them more equally during the model training process.
Define the min-max normalization function
min_max_normalize <- function(x) {
return((x - min(x)) / (max(x) - min(x)))
}
Apply min-max normalization to all numeric columns
data_normalized <- as.data.frame(lapply(data, function(x) {
if (is.numeric(x)) min_max_normalize(x) else x
}))
Display the first few rows of the normalized data
head(data_normalized)
Load the dataset.
Identify and handle missing data by either removing or imputing missing values.
Apply min-max normalization to scale the numeric data.