Monday, July 30, 2018

Statistics & Probability - Unit 2 - Describing Quantitative data

Today's post is about displaying and describing quantitative data

There are various ways to describe quantitative data

Some of them are

- frequency tables : Frequency tables display the frequency of the variable over different values of the variable

e.g.
The following frequency table shows the number of hours of sleep that each of the staff members at Tia's Toy Store got on Thanksgiving night.
Number of hours of sleepNumber of employees
31
40
54
62
71
81

- dot plots : Dot plot shows the frequency of a variable by the number of dots at a given value
e.g.
The following dot plot shows the daily high temperature in Kats, Colorado in April. Each dot represents a different day.

$0$$2$$4$$6$$8$$10$$12$$14$$16$$18$$20$$\text{Temperature }(^\circ\text{C})$
- histogram
Histograms sum of the number of values in a range and the graph them as a bar chart

e.g

- stem and leaf plots - Stem and leaf plots uses a table to display data. The stem of the left side displays the first digit or digits. The leaf on the right displays the last digit.
For example, 543 and 548 can be displayed together on a stem and leaf as 54 | 3,8.

- Box and whisker plot
Below is a box and whisker plot. This plot divides the data into 4 quadrants. The center box is for the 2nd and 3rd quadrants of the data. The left and right lines are for 1st and 4th quadrants. The center line is exactly at the median of the distribution and divides the 2nd and 3rd quadrants.

Next we will talk about shapes of distributions

Looking at the distribution we can determine the following

- whether the distribution is symmetrical (same shape both sides of the center)
- we can also see whether the distribution is skewed to the left or right. It is skewed to the left if most of the data is to the right and vice versa.
- what the spread ( difference between the max and min value) is
- where could the median lie.

Monday, July 23, 2018

Statistics & Probability - Unit 1 - Analyzing Categorical Data

I want to work on the emerging Artificial Intelligence field. I am excited about the possibilities of automating the grunt work in our daily lives and that is what drives me to learn this.

One of the important subject to learn and master is statistics and probability. For that I'm learning Statistics and Probability from Khan Academy. It is an excellent course and you should go do it. This is just a summary of concepts for quick reference.

Analyzing categorical data:

Individuals, variables, categorical and quantitative variables.

Consider the following dataset given below.

Alek is taking an inventory of styles of compression bandages for work. Here is the data he has collected.
Style IDWidth (inches)Total length (yards)Color
001120tan
002120brown
003110red
004115blue
005235tan
006220brown

In this dataset, Alek is taking inventory of styles. So they are the individuals in the dataset. For each style, the variables are width, total length and color. Of these, color is called a categorical variable as it can take value from certain categories. The width and total length are quantitative or numerical variables as they can take any value in a continuous range.
So in the dataset, styles are the individuals. There are three variables of which on is categorical and other two are numerical.

bar graphs can be used to display numeric variables in a category in bars with the height of the bar displaying the value of the numeric values.

Also they can be used to display the frequency of a category.

DianeGirishWillyNumber of sheep$= 5 \text{ sheep}$

A pictograph showing number of sheep for various people. The same can be shown in a bar graph with a bar showing the height.

Two way tables and Venn diagrams

If there are two variables and we have data related to the two categorical variables, they can be shown in a two way frequency table or venn diagrams.

PreferenceMaleFemale
Prefers dogs3622
Prefers cats826
No preference26

This can be shown in a Venn diagram also.

Marginal and conditional distributions

Weather conditionOn timeDelayedTotal
Sunny1673170
Cloudy1155120
Rainy401555
Snowy81220
Total33035365

The last row and columns are called marginal distributions as we write in the margins. The last column distribution is the marginal distribution of trains runs on different weather conditions.

The last row is a marginal distribution of trains being  on time or delayed.

On time row is a conditional distribution of weather conditions for on-time trains.