Monday, August 6, 2018

Statistics & Probability - Unit 3 - Summarizing Quantitative Data

Quantitative data can be summarized by trying to find the central tendency of the data.
There are various methods to find the central tendency. Common ones are

Mean : The arithmetic mean of the data. It is sensitive to outliers. So one large or small value can skew the mean.

Median: The central value in the sorted dataset from least to greatest. If there are even number of data, then it is the arithmetic mean of the central two values. Median is not affected by outliers that much.

Mode: Mode is the most repeating item in the data set. If none is repeating then there is no mode.

If some data is removed, it changes the mean more than the median.
If some data is changed, it will not change the median but will change the mean.

Next we look for some measures to identify the spread of the population. How far are each of the data points etc.

Range: Range is the difference between the largest and the smallest value.

InterQuartile Range:
IQR is the difference between the 3rd quartile and the 1st quartile. The median divides the data into exactly half. So half of the population is below the median and half equal or above the median.
First quartile is the the median of the first half of the data.
Third quartile is the median of the second half of the data.
So the data with the different quartiles can be show as 5 number summary

Min   Q1    Median Q3  Max

These numbers divide the data into 4 quardrants. The data between min and Q1 is first quadrant, Q1 and median is 2nd quadrant, Median and Q3 is third quadrant and Q3 and max is the 4th quadrant.

IQR is the difference between Q3 and Q1.

1.5 IQR Rule to find outliers:
This rule is used to identify the outliers in the data. The data less than 1.5IQR from Q1 and more than 1.5IQR from Q3 are considered outliers.

Outliers < Q1 - 1.5IQR
Outlers > Q3 + 1.5 IQR

Variance of a population:

Variance is defined as the mean of the square of the distances of each point from the mean.

If there are 4 values v1, v2 ,v3, v4, then the mean u (greek mu) is

v1 + v2 + v3 + v4 / (number of values = 4)

The variance of the population is

(v1 - u) ^2 + (v2-u)^2 + (v3-u) ^2 + (v4-u) ^2 / 4

If we don't have a way to find all the data of the population, we work with a small representative sample from the population.

Then this formula underestimates the variance and intuitively it makes sense as we will be considering only few points and the real mean is always outside of our sample. So this will underestimate the variance and hence this estimate is called biased estimate of variance

The correct estimate of sample variance is

Sum (mean square differences) / (n-1 = 1 less than the sample size)

Standard Deviation
Standard Deviation (Greek sigma)  is the square root of the variance (Greek sigma square).

Box and whisker plots give the 5 number summary of the population.

Other ways of measuring spread is a Mean absolute deviation.
It is the mean of the absolute distances from mean instead of square of the differences.
The MAD for the above distribution with v1, v2, v3, v4 as values and u as the mean is

i.e |v1 - u| + |v2 - u| + |v3-u| + |v4-u| / 4

No comments:

Post a Comment