Distributions in Statistics
The normal distribution is a symmetric, bell-shaped distribution. The normal distribution is the most important distribution because of four reasons:
We expect that many of the dependent variables, with which we work, are normally distributed in the population.
If a variable is (approximately) normally distributed, we are able to make claims about the values of that variable (it is often a prerequisite to do analyses).
When an infinite number of samples is drawn from a population, the distribution of these samples tends towards a normal distribution.
Most statistical programs assume that the observations are normally distributed.
(The normal distribution uses so-called z-scores. To discuss the normal distribution, we therefore first have to explain what z-scores are and how they can be used.)
Standard scores or z-scores
Results from tests or surveys have thousands of possible results and units. However, those results can often seem meaningless. For example, knowing that someone’s weight is 90 kilos might be good information, but if you want to compare it to the “average” person’s weight, looking at a vast table of data can be overwhelming (especially if some weights are recorded in pounds). A z-score can tell you where that person’s weight is compared to the average population’s mean weight.
Simply put, a z-score is the number of standard deviations from the mean a data point is. But more technically it’s a measure of how many standard deviations below or above the population mean a raw score is. Z-scores are a way to compare results from a test to a “normal” population.
The basic z score formula for a sample is:
\[z = \frac{(x - \mu)}{\sigma}\]
\[z = \frac{(x - \mu)}{\sigma}\]
For example, let’s say you have a test score of 220. The test has a mean (μ) of 160 and a standard deviation (σ) of 25. Assuming a normal distribution, your z score would be:
\[z=\frac{(x - \mu)}{\sigma}=\frac{220-160}{25}=2.4\]
The z score tells you how many standard deviations from the mean your score is. In this example, your score is 2.4 standard deviations above the mean.
The z-score describes the exact position of a X-value in two ways:
Via the sign: the mean μ lies always in the middle of the curve, where the standard deviation (σ) is zero. To the right side of the mean, the z-scores have a positive sign. Z-scores at the left side of the mean obtain a minus sign.
Via the value: the value of the z-score describes the distance between the X-value and the mean in terms of number of standard deviations (a z-score of 1.00 means that the X-value is 1 standard deviation away from the mean).
The z-score and a normal distribution
The standard normal distribution has a mean of 0 and a standard deviation of 1. The distribution is thus N(0,1). The normal distribution is symmetric; the highest frequency is in the middle, and the frequencies decrease to the left and the right of the distribution. Z-scores for normal distributions are given in terms of standard deviations. A z-score of +2 means that the scores is two standard deviations above the mean. For a normal distribution, the following rules apply:
± 68% of the observations falls between 1 standard deviation of the mean
± 95% of the observations falls between 2 standard deviations of the mean
± 99.7% of the observations falls between 3 standard deviations of the mean
Central limit theorem
The central limit theorem states that if you have a population with mean μ and standard deviation σ and take sufficiently large random samples from the population with replacement, then the distribution of the sample means will be approximately normally distributed. This happens only if the sample size n is large enough, that is at least n = 30.
Imagine: a distribution of intelligence has a μ of 100 and a σ of 15. What is the chance that, by means of random sampling, an individual is selected with an IQ below 130?
- To answer this question, the IQ-scores (X-values) first have to be transferred to z-scores. Next, the corresponding proportion has to be determined. This is in accordance with the chance that has to be determined.
- Here, the z-score is +2. This score is computed as follows: (130-100)/15 = 2. According to table A, the corresponding proportion is 0.9772.
- Thus: P(X<130) = 0.9772, meaning that there is a 97.72% chance to randomly select someone with an IQ below 130.
What to do when you want to examine the proportion between two values? For example, the mean driving speed is 58. The standard deviation is 10. How many of the cars that pass by will drive between 55 and 65 kilometers per hour?
- You are actually looking for p(55<X<65) here.
- First, calculate the z-score for both values. For X=55, the z-score is -0.30, because (55-58)/10 = -0.30. For X=65, the z-score is 0.70. Find the corresponding proportion for both values in the table for normal distributions (table A). The proportion of scores below 65 (z = 0.70) = .7580. The proportion of scores below 55 (z = -0.30) = .3821.
- Because the question refers to the proportion between these two values, the final answer is: 0.7580 – 0.3821 = 0.3759 = 37.59%.
When a variable is measured on a scale with exactly two categories, the resulting data is called binomial. Binomial data can also result from a variable that only has two categories. For example, people are either male or female. Only heads or tails can result from tossing a coin. In addition, it may happen that a researcher wants to simplify the data by subdividing it into two categories. For example, a psychologist may use scores on a personality test to classify aggression as high or low. Often, the researcher knows the chances on both categories. For tossing a coin for example, the chance on both head and tails is 50%. For a researcher, it is important to know how often an event occurs when there are multiple runs. For example, what is the chance that someone tosses head 15 times, when tossing the coin 20 times in total?
To answer questions about chance on the binomial distribution, you first have to exam the binomial distribution. The formula for the binomial distribution is:
\[p(x)=\frac{N}{X(N-X)}p\cdot q(N-X)\]
- p(X) = the chance on X successes
- X = number of success
- N = the number of trials
- p = the chance of success on 1 trial
- q = (1-p); the chance of failure
The chance that someone tosses head 15 times when tossing the coin 20 times would be:
- p(X) = the chance on X successes = ?
- X = number of success = 15
- N = the number of trials = 20
- p = the chance of success on 1 trial = 0,5
- q = (1-p); the chance of failure = 1 - 0,5 = 0,5
\[p(x)=\frac{20}{15(20-15)}0,5\times 0,5(20-15)=0,333333\]
Mean and variance
When p = q = 0.50, for example when tossing a coin, the binomial distribution will be symmetric. The formulas for mean, variance and standard deviation are:
\[Mean=N\cdot p\]
\[Variance=N\cdot p\cdot q\]
\[Standard\: deviation=\sqrt{N\cdot p\cdot q}\]
- N = the number of trials
- p = the chance of success on 1 trial
- q = (1 - p) = the chance of failure
For the binomial distribution, it applies that the distribution becomes more normal when p and q are close to 0.50. In addition, the distribution becomes more symmetric and more normal, when the number of trials increases. The rule-of-thumb is that, when N*p and N*q do not exceed 5, the distribution is close to normal. Then, estimations are reasonably well when we treat the distribution as normal.
When we are facing categorical data, the data exists of frequencies of observations that are subdivided into two or more categories. For these data, we can use the Chi-square test.
The formula of the Chi-square distribution differs from other functions, because it only has one parameter - the others are constants. Where the normal distribution had two parameters (μ and σ as described above), the Chi-square only has k as parameter, which refers to the X² degrees of freedom (df).
The Chi-square distribution uses the observed frequencies and the expected frequencies. The observed frequencies are the actual frequencies in the data. The expected frequencies are the frequencies that you would expect, if the null hypothesis is true. The formula for the Chi-square is:
\[x^2=\sum\frac{(O-E)^2}{E}\]
- O = observed frequencies
- E = expected frequencies
- Thus, you compute for each category the (O-E)2/E and sum these up.
Table of the Chi-square distribution
Now that we have a value for X², we have to compare this value with the X² distribution to determine the chance that a value of X² is at least as extreme, given the null hypothesis. To do so, you can use the standard table distribution of X² (table F). The table uses the degrees of freedom. For a uni-dimensional table, it applies that: df = (k-1): thus the number of categories minus one. If the obtained X² value is larger than the value in the table, the null hypothesis can be rejected. The problem is however, that the Chi-square distribution is continuous, while the possible values of Chi-square are discrete (especially for small sample sizes). Fitting a discrete distribution into a continuous distribution results in a bad fit.
For more information, see: http://math.hws.edu/javamath/ryan/ChiSquare.html
Two classification examples
In the previous examples, we discussed one dimension (or classification variable). However, often multiple classification variables are present and one wants to examine whether these are independent. When the variables are not independent, they are to a certain extent contingent or dependent upon each other. In a contingency table, we can place the distribution of each variable against each other.
In a contingency table, you note the frequencies you would expected if the variables are independent (between brackets). The expected frequencies are obtained by multiplying the row totals by the column totals (the marginal totals) and dividing this by the total sample size. (The chance that an observation belongs to row 1 is the total of that row divided by the number of cells within that row. This also applies to the columns. The expected frequency, if all observations are independent, can be obtained by multiplying these two chances, and fractioning this result by N.)
\[E_{ij}=\frac{R_i\cdot C_j}{N}\]
- Eij = the expected frequency for a cell with row i and column j
- Ri = row totals of row i
- Cj = column totals of column j
- N = number of observations
The value of X² can again be calculated with the same formula. The degrees of freedom can be deduced from the contingency table by:
\[df=(R-1)\cdot(C-1)\]
with R and C number of rows and columns in the table.
Prerequisite of the Pearson Chi-square
One of the main prerequisites to use the Chi-square, is a reasonable size of expected frequencies. Small expected frequencies may cause problems, because they cause a limited number of contingency tables and hence a limited number of values for the Chi-square distribution. The continuous X² can not describe this discrete distribution well.
In general, the rule is that all expected frequencies should be at least five. For smaller frequencies, it is advised to use Fisher’s Exact Test, because this test is not based on the X² distribution. For 2x2 tables with expected frequencies of 1, the X² can be found with the following formula:
\[X^2 adj=\frac{(X^2 \cdot N)}{(N-1)}\]
The Fisher’s Exact Test is used when expected frequencies are larger than one.
Measuring agreement
With categorical data, it is important to measure to what extent observes agree in their judgements. Imagine that we want to measure the problems of 30 adolescents, with a subdivision into (1) no problems (2) problems at school (3) problems at home. We ask the two observers to examine this, so that we can compare their judgements. By means of a contingency table, we examine how often each observed assessed each score. Imagine that they agree 20 out of the 30 times (the diagonal cells), then there is an agreement of 66%. This is the percentage of agreement.
The problem with calculating a percentage, is that we have to take into account the possibility that the observes agree by chance. To correct for this, Cohen developed the statistic kappa (K). The formula for the kappa is:
in which f0 is the observed frequency on the diagonal and fe is the expected frequency on the diagonal. Assume the kappa is K = 0.33. This implies that –after correction for chance- the agreement between the two observers is 33%. This is much lower that the prior computed value of 66%.