Monday, August 13, 2018

Statistics & Probability - Part 4 - Modelling data distributions



Modelling Data Distributions



Calculating Percentiles:


For any distribution we can calculate percentiles to express the percentage of data below a point.

For example in the below dot plot lets calculate the rank of the point whose value is 6.

Total number of points : 14
Number of points below  6:

Data that is below 6 = 50 percent
Data that is below or equal to 6  = 8/14 = 57






Cumulative Relative Frequency Graphs


If we have a relative frequency graph with the percentages on the y axis and data on the x axis
we can use that to estimate the percentiles at any point, calculate the median and IQR of the data.


Z- scores

Z -score of a data point is nothing but the number of standard deviations it is away from the mean. If it is less than the mean, the z -score is negative and if it is greater than the mean the z-score is positive.

We can use z-scores to compare distributions with different means also. This will make it easier to compare two distributions if we think in terms of z-score.

Effect of Linear Transformation on data 

When a constant value is added to the data (data is shifted), both the mean and median increase by that constant value.
The statistics like standard deviation and IQR dont change.


When a constant value is multiplied  to the data (data is scaled), all the values mean, median, standard deviation and IQR are all scaled by the same factor.

Density Curves

Density curve is an idealized representation of a distribution where the area under the curve is 1.
We can estimate the mean median and skew from the density curves.
For a symmetric density curve, the mean and median are exactly at the center.
For a non-symmetric density curve, the median is still the center, but the mean is the balancing point on a fulcrum.

For a left skewed distribution, the mean is to the left of the median and the opposite for a right skewed distribution.

We can calculate the area under parts of density curve by approximating the areas to be rectangles and triangles.

Normal Distribution

A function that represents the distribution of many random variables as a symmetrical bell shaped graph. 
In a normal distribution 68% of the values are distributed between 1 std deviation from the mean, 95% of the data between 2 standard deviations from the mean and 99.7 of the data between 3 standard deviations from mean. This is called the empirical rule for normal distributions and can be used to calculate the proportions above and below a certain value. First we can find the z-score and a table to lookup the percentile of data below that z-score for a normal distribution and we can use that to find the values lower than, higher than or between two values. 

We can also do a reverse lookup to find out the z-score from the percentile tables to estimate the data point above or below a percentile. 




No comments:

Post a Comment

Comments