1. Descriptive Statistics: Measures of Central Tendency

Measures of central tendency summarize a dataset by identifying the center of the data distribution.

a. Mean

The mean, or arithmetic average, is one of the most common measures of central tendency. It is calculated by summing all data points and dividing by the total number of points. The formula is: \[ \text{Mean} = \frac{\sum X}{n} \] where \(X\) represents individual data points, and \(n\) is the total number of data points.

Advantages:

  • Easy to calculate and understand.

  • Uses all data points, giving a comprehensive view of the dataset.

Disadvantages:

  • Highly sensitive to outliers.

  • Extreme values can skew the mean significantly.

Applications:

  • Commonly used in fields like economics, engineering, and education to provide an average value, e.g., calculating the average income in a region.

Example:

# Sample data
data <- c(5, 8, 10, 6, 7, 8, 9)

# Calculate mean
mean_value <- mean(data)
mean_value  # Output: 7.571429
[1] 7.571429

b. Median

The median is the middle value in a sorted dataset. If the dataset has an odd number of values, the median is the center value. For an even number of values, the median is the average of the two middle numbers.

Advantages:

  • Less affected by outliers compared to the mean.

  • Provides a better central value for skewed distributions.

Disadvantages:

  • Doesn’t use all data points, which can result in less precision.

Applications:

  • Often used in income data to give a more accurate picture of central tendency in skewed distributions.

Example:

# Calculate median
median_value <- median(data)
median_value  # Output: 8
[1] 8

c. Mode

The mode is the most frequently occurring value in a dataset. A dataset may have one mode (unimodal), more than one mode (bimodal, multimodal), or no mode at all.

Advantages:

  • Useful for categorical data.

  • Can easily highlight the most common value in a dataset.

Disadvantages:

  • May not exist or may not be unique in a dataset with no repeated values.

Applications:

  • Commonly used in marketing to determine the most frequent consumer preferences.

R does not have a built-in function for mode, so we need to create one.

Example:

# Custom function to calculate mode
get_mode <- function(v) {
  uniq_v <- unique(v)
  uniq_v[which.max(tabulate(match(v, uniq_v)))]
}

# Calculate mode
mode_value <- get_mode(data)
mode_value  # Output: 8
[1] 8

2. Measures of Dispersion

These measures describe the spread or variability within a dataset.

a. Standard Deviation

The standard deviation measures the amount of variation in a dataset. It is the square root of the variance, and its formula is: \[ \text{Standard Deviation} = \sqrt{\frac{\sum (X - \text{Mean})^2}{n}} \]

Advantages:

  • Provides insight into how spread out the values are from the mean.

Disadvantages:

  • Can be difficult to interpret without context, especially for skewed distributions.

Applications:

  • Widely used in finance to measure market volatility.

Example:

# Calculate standard deviation
sd_value <- sd(data)
sd_value  # Output: 1.718249
[1] 1.718249

b. Variance

Variance is the average of the squared differences from the mean. It provides a squared measure of dispersion.

Advantages:

  • A comprehensive measure of variability.

Disadvantages:

  • Squaring can distort the perception of scale, making it hard to interpret compared to standard deviation.

Applications:

  • Used in various statistical models, including machine learning algorithms.

Example:

# Calculate variance
variance_value <- var(data)
variance_value  # Output: 2.952381
[1] 2.952381

c. Range

The range is the difference between the maximum and minimum values in the dataset.

Advantages:

  • Simple to calculate and understand.

Disadvantages:

  • Doesn’t account for distribution between the extremes.

Applications:

  • Used in quick assessments of variability, such as stock price analysis.

Example:

# Calculate range
range_value <- range(data)
range_value_diff <- diff(range_value)
range_value_diff  # Output: 5
[1] 5

3. Interquartile Range (IQR)

The Interquartile Range (IQR) measures the spread of the middle 50% of data points. It is calculated as the difference between the third quartile (Q3) 75th percentile and the first quartile (Q1) 25th percentile: \[ \text{IQR} = Q3 - Q1 \] The IQR is a robust measure of spread, particularly for skewed distributions, since it is not affected by outliers.

Advantages:

Disadvantages:

Applications:

Example:

# Calculate IQR
iqr_value <- IQR(data)
iqr_value  # Output: 2.5
[1] 2

4. Sampling Distribution

A sampling distribution refers to the probability distribution of a given statistic based on a random sample from a population. The sampling distribution of the sample mean, for example, describes the spread of sample means if many samples are drawn from the population.

Advantages:

Disadvantages:

Applications:

Example:

# Simulate a sampling distribution of the mean
set.seed(123)
sample_means <- replicate(1000, mean(sample(data, size = 5, replace = TRUE)))

# Plot the sampling distribution
hist(sample_means, main = "Sampling Distribution of the Mean", xlab = "Mean")

5. Probability Distributions

A probability distribution shows all possible values of a random variable and the probabilities associated with those values.

a. Normal Distribution

The Normal Distribution, also known as the Gaussian distribution, is characterized by a bell-shaped curve. Most values cluster around the mean, with symmetrical tails extending in both directions. The probability density function for a normal distribution is: \[ f(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x - \mu)^2}{2\sigma^2}} \] where \(\mu\) is the mean, and \(\sigma\) is the standard deviation.

Advantages:

  • Many real-world phenomena are normally distributed, making this distribution widely applicable.

Disadvantages:

  • Assumes data are symmetrically distributed, which may not always be the case.

Applications:

  • Used in hypothesis testing, quality control, and in finance for risk assessment.

Example:

# Generate a sample from a normal distribution
normal_data <- rnorm(1000, mean = 0, sd = 1)

# Plot the normal distribution
hist(normal_data, breaks = 30, probability = TRUE, main = "Normal Distribution")
curve(dnorm(x, mean = 0, sd = 1), add = TRUE, col = "red", lwd = 2)

b. Binomial Distribution

The Binomial Distribution represents the number of successes in a fixed number of independent trials, each with the same probability of success. Its probability mass function is: \[ P(X = k) = \binom{n}{k} p^k (1 - p)^{n - k} \] where \(n\) is the number of trials, \(k\) is the number of successes, and \(p\) is the probability of success.

Advantages:

  • Useful for modeling discrete events.

Disadvantages:

  • Assumes independent trials and constant probability, which may not always hold in practice.

Applications:

  • Used in quality control, genetics, and risk modeling.

Example:

# Generate a sample from a binomial distribution
binom_data <- rbinom(1000, size = 10, prob = 0.5)

# Plot the binomial distribution
hist(binom_data, breaks = 10, probability = TRUE, main = "Binomial Distribution")

Summary

These concepts are foundational in statistics and data analysis, helping us summarize and understand data effectively.

