Monday, July 23, 2018

Statistics & Probability - Unit 1 - Analyzing Categorical Data

I want to work on the emerging Artificial Intelligence field. I am excited about the possibilities of automating the grunt work in our daily lives and that is what drives me to learn this.

One of the important subject to learn and master is statistics and probability. For that I'm learning Statistics and Probability from Khan Academy. It is an excellent course and you should go do it. This is just a summary of concepts for quick reference.

Analyzing categorical data:

Individuals, variables, categorical and quantitative variables.

Consider the following dataset given below.

Alek is taking an inventory of styles of compression bandages for work. Here is the data he has collected.
Style IDWidth (inches)Total length (yards)Color

In this dataset, Alek is taking inventory of styles. So they are the individuals in the dataset. For each style, the variables are width, total length and color. Of these, color is called a categorical variable as it can take value from certain categories. The width and total length are quantitative or numerical variables as they can take any value in a continuous range.
So in the dataset, styles are the individuals. There are three variables of which on is categorical and other two are numerical.

Reading bar graphs, pictographs

bar graphs can be used to display numeric variables in a category in bars with the height of the bar displaying the value of the numeric values.

Also they can be used to display the frequency of a category.

DianeGirishWillyNumber of sheep= 5 \text{ sheep}

A pictograph showing number of sheep for various people. The same can be shown in a bar graph with a bar showing the height.

Two way tables and Venn diagrams

If there are two variables and we have data related to the two categorical variables, they can be shown in a two way frequency table or venn diagrams.

Prefers dogs3622
Prefers cats826
No preference26

This can be shown in a Venn diagram also.

Marginal and conditional distributions

Weather conditionOn timeDelayedTotal

The last row and columns are called marginal distributions as we write in the margins. The last column distribution is the marginal distribution of trains runs on different weather conditions.

The last row is a marginal distribution of trains being  on time or delayed.

On time row is a conditional distribution of weather conditions for on-time trains.

No comments:

Post a Comment