1. Descriptive Statistics: Measures of Central
Tendency
Measures of central tendency summarize a dataset by identifying the
center of the data distribution.
a. Mean
The mean, or arithmetic average, is one of the most
common measures of central tendency. It is calculated by summing all
data points and dividing by the total number of points. The formula is:
\[
\text{Mean} = \frac{\sum X}{n}
\] where \(X\) represents
individual data points, and \(n\) is
the total number of data points.
Advantages:
Disadvantages:
Applications:
- Commonly used in fields like economics, engineering, and education
to provide an average value, e.g., calculating the average income in a
region.
Example:
# Sample data
data <- c(5, 8, 10, 6, 7, 8, 9)
# Calculate mean
mean_value <- mean(data)
mean_value # Output: 7.571429
[1] 7.571429
c. Mode
The mode is the most frequently occurring value in a
dataset. A dataset may have one mode (unimodal), more than one mode
(bimodal, multimodal), or no mode at all.
Advantages:
Disadvantages:
- May not exist or may not be unique in a dataset with no repeated
values.
Applications:
- Commonly used in marketing to determine the most frequent consumer
preferences.
R does not have a built-in function for mode, so we need to create
one.
Example:
# Custom function to calculate mode
get_mode <- function(v) {
uniq_v <- unique(v)
uniq_v[which.max(tabulate(match(v, uniq_v)))]
}
# Calculate mode
mode_value <- get_mode(data)
mode_value # Output: 8
[1] 8
2. Measures of Dispersion
These measures describe the spread or variability within a
dataset.
a. Standard Deviation
The standard deviation measures the amount of
variation in a dataset. It is the square root of the variance, and its
formula is: \[
\text{Standard Deviation} = \sqrt{\frac{\sum (X - \text{Mean})^2}{n}}
\]
Advantages:
- Provides insight into how spread out the values are from the
mean.
Disadvantages:
- Can be difficult to interpret without context, especially for skewed
distributions.
Applications:
- Widely used in finance to measure market volatility.
Example:
# Calculate standard deviation
sd_value <- sd(data)
sd_value # Output: 1.718249
[1] 1.718249
b. Variance
Variance is the average of the squared differences
from the mean. It provides a squared measure of dispersion.
Advantages:
- A comprehensive measure of variability.
Disadvantages:
- Squaring can distort the perception of scale, making it hard to
interpret compared to standard deviation.
Applications:
- Used in various statistical models, including machine learning
algorithms.
Example:
# Calculate variance
variance_value <- var(data)
variance_value # Output: 2.952381
[1] 2.952381
c. Range
The range is the difference between the maximum and
minimum values in the dataset.
Advantages:
- Simple to calculate and understand.
Disadvantages:
- Doesn’t account for distribution between the extremes.
Applications:
- Used in quick assessments of variability, such as stock price
analysis.
Example:
# Calculate range
range_value <- range(data)
range_value_diff <- diff(range_value)
range_value_diff # Output: 5
[1] 5
3. Interquartile Range (IQR)
The Interquartile Range (IQR) measures the spread of
the middle 50% of data points. It is calculated as the difference
between the third quartile (Q3) 75th percentile and the first quartile
(Q1) 25th percentile: \[
\text{IQR} = Q3 - Q1
\] The IQR is a robust measure of spread, particularly for skewed
distributions, since it is not affected by outliers.
Advantages:
- Resistant to outliers, making it ideal for skewed datasets.
Disadvantages:
- Ignores extreme values, which might be important in some
analyses.
Applications:
- Commonly used in box plots to highlight data dispersion and detect
outliers in fields like data science and economics.
Example:
# Calculate IQR
iqr_value <- IQR(data)
iqr_value # Output: 2.5
[1] 2
4. Sampling Distribution
A sampling distribution refers to the probability
distribution of a given statistic based on a random sample from a
population. The sampling distribution of the sample mean, for example,
describes the spread of sample means if many samples are drawn from the
population.
Advantages:
- Allows for inferences about population parameters.
Disadvantages:
- Requires a large number of samples to be effective.
Applications:
- Fundamental in inferential statistics, enabling hypothesis testing
and confidence intervals.
Example:
# Simulate a sampling distribution of the mean
set.seed(123)
sample_means <- replicate(1000, mean(sample(data, size = 5, replace = TRUE)))
# Plot the sampling distribution
hist(sample_means, main = "Sampling Distribution of the Mean", xlab = "Mean")

5. Probability Distributions
A probability distribution shows all possible values of a random
variable and the probabilities associated with those values.
a. Normal Distribution
The Normal Distribution, also known as the Gaussian
distribution, is characterized by a bell-shaped curve. Most values
cluster around the mean, with symmetrical tails extending in both
directions. The probability density function for a normal distribution
is: \[
f(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x - \mu)^2}{2\sigma^2}}
\] where \(\mu\) is the mean,
and \(\sigma\) is the standard
deviation.
Advantages:
- Many real-world phenomena are normally distributed, making this
distribution widely applicable.
Disadvantages:
- Assumes data are symmetrically distributed, which may not always be
the case.
Applications:
- Used in hypothesis testing, quality control, and in finance for risk
assessment.
Example:
# Generate a sample from a normal distribution
normal_data <- rnorm(1000, mean = 0, sd = 1)
# Plot the normal distribution
hist(normal_data, breaks = 30, probability = TRUE, main = "Normal Distribution")
curve(dnorm(x, mean = 0, sd = 1), add = TRUE, col = "red", lwd = 2)

b. Binomial Distribution
The Binomial Distribution represents the number of
successes in a fixed number of independent trials, each with the same
probability of success. Its probability mass function is: \[
P(X = k) = \binom{n}{k} p^k (1 - p)^{n - k}
\] where \(n\) is the number of
trials, \(k\) is the number of
successes, and \(p\) is the probability
of success.
Advantages:
- Useful for modeling discrete events.
Disadvantages:
- Assumes independent trials and constant probability, which may not
always hold in practice.
Applications:
- Used in quality control, genetics, and risk modeling.
Example:
# Generate a sample from a binomial distribution
binom_data <- rbinom(1000, size = 10, prob = 0.5)
# Plot the binomial distribution
hist(binom_data, breaks = 10, probability = TRUE, main = "Binomial Distribution")

Summary
Mean, Median, Mode: These are measures of
central tendency, showing the “center” of your data.
Standard Deviation, Variance, Range, IQR: These
are measures of dispersion, showing how spread out the data is.
Sampling Distribution: Shows the distribution of
a sample statistic (e.g., the mean) across many samples.
Probability Distributions: Describe the
likelihood of different outcomes; examples include the normal and
binomial distributions.
These concepts are foundational in statistics and data analysis,
helping us summarize and understand data effectively.
