M1 - Descriptive Statistics

In this module, we will cover:

Populations and Samples
Pictorial and Tabular Methods
Measures of Location
Measures of Variability

Objectives

By the end of this module, you will be able to:

Calculate some descriptive statistics (Mean, Median, Percentiles, Variance and Standard Deviation) and construct a histogram given a set of univariate data.
Understand the difference between measures of location and measures of variability.
Construct a scatter plot of bivariate data and a box plot of univariate data.

1.1 Populations, Sample, and Processes

1.2 Pictorial and Tabular Methods in Descriptive Statistics

1.3 Measures of Location

1.4 Measures of Variability

1.5 Excel

Mean

=AVERAGE(select all data points)

Variability

x - x.bar
tip: f4 to lock cell

if you were to add all the differences (values of (x - x.bar) ) together -> you will get a number very close to zero

=SUM(x-x.bar)

so then we square each difference (x-x.bar )

= (x-x.bar ^2)

Final Calculation: = sum ((x-x.bar)^2) / (count(values)-1) Ex. =SUM(C2:C29)/(COUNT(C2:C29)-1)

OR: =VAR(select all data points)

=VAR.P() for population
=VAR.S() for sample
- gives the same values as =VAR()

these are different because when you’re calulcating the variance on the population, you do not need to calculate the mean. The mean is no longer an estimate, it is known (the true population mean), therefore there is not that bias. It did not need to be corrected by the n-1

Note: 99% of the time, will be working with sample statistics

Standard Deviation

=SQRT(Variance)

OR =STDEV(select all data points)

Summary

Fundamental Concepts

Data: Observations (measurements, counts, or categories) collected for study

Population: The entire collection of elements (individuals, objects, or measurements) under study

Sample: A subset of the population selected for analysis.

Variable: A characteristic of an element that can assume different values

Observation (Measurement): The value of a variable for a particular element.

Quantitative Data: Measurements expressed numerically (counts or amounts)

Qualitative (Categorical) Data: Measurements classified by labels or categories

Univariate Data: Data consisting of one variable measured on each element

Bivariate Data: Data consisting of two variables measured on each element

Multivariate Data: Data consisting of more than two variables measured on each element

Graphical Methods (Displaying Data)

Frequency Distribution: A tabular summary of data showing the number (frequency) of observations in each category or interval.

Relative Frequency: The proportion of the total number of observations falling into a class

Relative Frequency = \frac{Frequency of Class​}{Total Number of Observations}

Histogram: A bar-type graph showing frequencies (or relative frequencies) for classes of quantitative data; adjacent bars touch.

Stem-and-Leaf Display: A table where each observation is split into a “stem” (leading digits) and a “leaf” (final digit), showing the distribution while preserving data values.

Dotplot: A simple display placing a dot above a number line for each data value, stacking dots for repeated values.

Boxplot (or Box-and-Whisker Plot): A graphical summary based on the five-number summary (min, Q1, median, Q3, max); shows spread, skewness, and outliers.

Scatterplot: A plot of paired bivariate data points on an xy-plane, useful for studying relationships between two variables.

Populations and Samples

Pictorial and Tabular Methods

Measures of Location

Measures of Variability

Read about how-to guides in the Diátaxis framework

\int_{-\infty}^{\infty} e^{-x^2}\,dx = \sqrt{\pi}

Euler: $e^{i\\pi} + 1 = 0$ .

$\int_{0}^{1} \sin ()$