Understanding distributions in statistics

Normal distribution

The normal distribution is a symmetric, bell-shaped distribution. The normal distribution is the most important distribution because of four reasons:

  1. We expect that many of the dependent variables, with which we work, are normally distributed in the population.

  2. If a variable is (approximately) normally distributed, we are able to make claims about the values of that variable (it is often a prerequisite to do analyses).

  3. When an infinite number of samples is drawn from a population, the distribution of these samples tends towards a normal distribution.

  4. Most statistical programs assume that the observations are normally distributed.

(The normal distribution uses so-called z-scores. To discuss the normal distribution, we therefore first have to explain what z-scores are and how they can be used.)

Standard scores or z-scores

Results from tests or surveys have thousands of possible results and units. However, those results can often seem meaningless. For example, knowing that someone’s weight is 90 kilos might be good information, but if you want to compare it to the “average” person’s weight, looking at a vast table of data can be overwhelming (especially if some weights are recorded in pounds). A z-score can tell you where that person’s weight is compared to the average population’s mean weight.

Simply put, a z-score is the number of standard deviations from the mean a data point is. But more technically it’s a measure of how many standard deviations below or above the population mean a raw score is. Z-scores are a way to compare results from a test to a “normal” population.

The basic z score formula for a sample is:

\[z = \frac{(x - \mu)}{\sigma}\]

\[z = \frac{(x - \mu)}{\sigma}\]

For example, let’s say you have a test score of 220. The test has a mean (μ) of 160 and a standard deviation (σ) of 25. Assuming a normal distribution, your z score would be:

\[z=\frac{(x - \mu)}{\sigma}=\frac{220-160}{25}=2.4\]

The z score tells you how many standard deviations from the mean your score is. In this example, your score is 2.4 standard deviations above the mean.

The z-score describes the exact position of a X-value in two ways:

Via the sign: the mean μ lies always in the middle of the curve, where the standard deviation (σ) is zero. To the right side of the mean, the z-scores have a positive sign. Z-scores at the left side of the mean obtain a minus sign.

Via the value: the value of the z-score describes the distance between the X-value and the mean in terms of number of standard deviations (a z-score of 1.00 means that the X-value is 1 standard deviation away from the mean).

The z-score and a normal distribution

The standard normal distribution has a mean of 0 and a standard deviation of 1. The distribution is thus N(0,1). The normal distribution is symmetric; the highest frequency is in the middle, and the frequencies decrease to the left and the right of the distribution. Z-scores for normal distributions are given in terms of standard deviations. A z-score of +2 means that the scores is two standard deviations above the mean. For a normal distribution, the following rules apply:

  • ± 68% of the observations falls between 1 standard deviation of the mean

  • ± 95% of the observations falls between 2 standard deviations of the mean

  • ± 99.7% of the observations falls between 3 standard deviations of the mean

 

Central limit theorem

The central limit theorem states that if you have a population with mean μ and standard deviation σ and take sufficiently large random samples from the population with replacement, then the distribution of the sample means will be approximately normally distributed. This happens only if the sample size n is large enough, that is at least n = 30.

Chances, proportions and scores

Imagine: a distribution of intelligence has a μ of 100 and a σ of 15. What is the chance that, by means of random sampling, an individual is selected with an IQ below 130?

  • To answer this question, the IQ-scores (X-values) first have to be transferred to z-scores. Next, the corresponding proportion has to be determined. This is in accordance with the chance that has to be determined.
  • Here, the z-score is +2. This score is computed as follows: (130-100)/15 = 2. According to table A, the corresponding proportion is 0.9772.
  • Thus: P(X<130) = 0.9772, meaning that there is a 97.72% chance to randomly select someone with an IQ below 130.

What to do when you want to examine the proportion between two values? For example, the mean driving speed is 58. The standard deviation is 10. How many of the cars that pass by will drive between 55 and 65 kilometers per hour?

  • You are actually looking for p(55<X<65) here.
  • First, calculate the z-score for both values. For X=55, the z-score is -0.30, because (55-58)/10 = -0.30. For X=65, the z-score is 0.70. Find the corresponding proportion for both values in the table for normal distributions (table A). The proportion of scores below 65 (z = 0.70) = .7580. The proportion of scores below 55 (z = -0.30) = .3821.
  • Because the question refers to the proportion between these two values, the final answer is: 0.7580 – 0.3821 = 0.3759 = 37.59%.

The binomial distribution

When a variable is measured on a scale with exactly two categories, the resulting data is called binomial. Binomial data can also result from a variable that only has two categories. For example, people are either male or female. Only heads or tails can result from tossing a coin. In addition, it may happen that a researcher wants to simplify the data by subdividing it into two categories. For example, a psychologist may use scores on a personality test to classify aggression as high or low. Often, the researcher knows the chances on both categories. For tossing a coin for example, the chance on both head and tails is 50%. For a researcher, it is important to know how often an event occurs when there are multiple runs. For example, what is the chance that someone tosses head 15 times, when tossing the coin 20 times in total?

To answer questions about chance on the binomial distribution, you first have to exam the binomial distribution. The formula for the binomial distribution is: 

\[p(x)=\frac{N}{X(N-X)}p\cdot q(N-X)\]

  • p(X) = the chance on X successes
  • X = number of success
  • N = the number of trials
  • p = the chance of success on 1 trial
  • q = (1-p); the chance of failure

The chance that someone tosses head 15 times when tossing the coin 20 times would be:

  • p(X) = the chance on X successes = ?
  • X = number of success = 15
  • N = the number of trials = 20
  • p = the chance of success on 1 trial = 0,5
  • q = (1-p); the chance of failure = 1 - 0,5 = 0,5

\[p(x)=\frac{20}{15(20-15)}0,5\times 0,5(20-15)=0,333333\]

Mean and variance

When p = q = 0.50, for example when tossing a coin, the binomial distribution will be symmetric. The formulas for mean, variance and standard deviation are:

\[Mean=N\cdot p\]

\[Variance=N\cdot p\cdot q\]

\[Standard\: deviation=\sqrt{N\cdot p\cdot q}\]

  • N = the number of trials
  • p = the chance of success on 1 trial
  • q = (1 - p) = the chance of failure

For the binomial distribution, it applies that the distribution becomes more normal when p and q are close to 0.50. In addition, the distribution becomes more symmetric and more normal, when the number of trials increases. The rule-of-thumb is that, when N*p and N*q do not exceed 5, the distribution is close to normal. Then, estimations are reasonably well when we treat the distribution as normal.

Categorical data and Chi-square

When we are facing categorical data, the data exists of frequencies of observations that are subdivided into two or more categories. For these data, we can use the Chi-square test.

The Chi-square distribution

The formula of the Chi-square distribution differs from other functions, because it only has one parameter - the others are constants. Where the normal distribution had two parameters (μ and σ as described above), the Chi-square only has k as parameter, which refers to the X² degrees of freedom (df).

The Chi-square distribution uses the observed frequencies and the expected frequencies. The observed frequencies are the actual frequencies in the data. The expected frequencies are the frequencies that you would expect, if the null hypothesis is true. The formula for the Chi-square is:

\[x^2=\sum\frac{(O-E)^2}{E}\]

  • O = observed frequencies
  • E = expected frequencies
  • Thus, you compute for each category the (O-E)2/E and sum these up.

Table of the Chi-square distribution

Now that we have a value for X², we have to compare this value with the X² distribution to determine the chance that a value of X² is at least as extreme, given the null hypothesis. To do so, you can use the standard table distribution of X² (table F). The table uses the degrees of freedom. For a uni-dimensional table, it applies that: df = (k-1): thus the number of categories minus one. If the obtained X² value is larger than the value in the table, the null hypothesis can be rejected. The problem is however, that the Chi-square distribution is continuous, while the possible values of Chi-square are discrete (especially for small sample sizes). Fitting a discrete distribution into a continuous distribution results in a bad fit.

For more information, see: http://math.hws.edu/javamath/ryan/ChiSquare.html

Two classification examples

In the previous examples, we discussed one dimension (or classification variable). However, often multiple classification variables are present and one wants to examine whether these are independent. When the variables are not independent, they are to a certain extent contingent or dependent upon each other. In a contingency table, we can place the distribution of each variable against each other.

In a contingency table, you note the frequencies you would expected if the variables are independent (between brackets). The expected frequencies are obtained by multiplying the row totals by the column totals (the marginal totals) and dividing this by the total sample size. (The chance that an observation belongs to row 1 is the total of that row divided by the number of cells within that row. This also applies to the columns. The expected frequency, if all observations are independent, can be obtained by multiplying these two chances, and fractioning this result by N.)

\[E_{ij}=\frac{R_i\cdot C_j}{N}\]

  • Eij = the expected frequency for a cell with row i and column j
  • R= row totals of row i
  • Cj = column totals of column j
  • N = number of observations

The value of X² can again be calculated with the same formula. The degrees of freedom can be deduced from the contingency table by:

\[df=(R-1)\cdot(C-1)\]

with R and C number of rows and columns in the table.

Prerequisite of the Pearson Chi-square

One of the main prerequisites to use the Chi-square, is a reasonable size of expected frequencies. Small expected frequencies may cause problems, because they cause a limited number of contingency tables and hence a limited number of values for the Chi-square distribution. The continuous X² can not describe this discrete distribution well.

In general, the rule is that all expected frequencies should be at least five. For smaller frequencies, it is advised to use Fisher’s Exact Test, because this test is not based on the X² distribution. For 2x2 tables with expected frequencies of 1, the X² can be found with the following formula:

\[X^2 adj=\frac{(X^2 \cdot N)}{(N-1)}\]

The Fisher’s Exact Test is used when expected frequencies are larger than one.

Measuring agreement

With categorical data, it is important to measure to what extent observes agree in their judgements. Imagine that we want to measure the problems of 30 adolescents, with a subdivision into (1) no problems (2) problems at school (3) problems at home. We ask the two observers to examine this, so that we can compare their judgements. By means of a contingency table, we examine how often each observed assessed each score. Imagine that they agree 20 out of the 30 times (the diagonal cells), then there is an agreement of 66%. This is the percentage of agreement.

The problem with calculating a percentage, is that we have to take into account the possibility that the observes agree by chance. To correct for this, Cohen developed the statistic kappa (K). The formula for the kappa is:
 

 

in which f0 is the observed frequency on the diagonal and fe is the expected frequency on the diagonal. Assume the kappa is K = 0.33. This implies that –after correction for chance- the agreement between the two observers is 33%. This is much lower that the prior computed value of 66%.

Statistics: suggestions, summaries and tips for encountering Statistics

Statistics: suggestions, summaries and tips for encountering Statistics

Knowledge and assistance for discovering, identifying, recognizing, observing and defining statistics.

Startmagazine: Introduction to Statistics
Stats for students: Simple steps for passing your statistics courses

Stats for students: Simple steps for passing your statistics courses

Image

How to triumph over the theory of statistics (without understanding everything)?

Stats of students

  • The first years that you follow statistics, it is often a case of taking knowledge for granted and simply trying to pass the courses. Don't worry if you don't understand everything right away: in later years it will fall into place, and you will see the importance of the theory you had to know before.
  • The book you need to study may be difficult to understand at first. Be patient: later in your studies, the effort you put in now will pay off.
  • Be a Gestalt Scientist! In other words, recognize that the whole of statistics is greater than the sum of its parts. It is very easy to get hung up on nit-picking details and fail to see the forest because of the trees
  • Tip: Precise use of language is important in research. Try to reproduce the theory verbatim (i.e. learn by heart) where possible. With that, you don't have to understand it yet, you show that you've been working on it, you can't go wrong by using the wrong word and you practice for later reporting of research.
  • Tip: Keep study material, handouts, sheets, and other publications from your teacher for future reference.

How to score points with formulas of statistics (without learning them all)?

  • The direct relationship between data and results consists of mathematical formulas. These follow their own logic, are written in their own language, and can therefore be complex to comprehend.
  • If you don't understand the math behind statistics, you don't understand statistics. This does not have to be a problem, because statistics is an applied science from which you can also get excellent results without understanding. None of your teachers will understand all the statistical formulas.
  • Please note: you will probably have to know and understand a number of formulas, so that you can demonstrate that you know the principle of how statistics work. Which formulas you need to know differs from subject to subject and lecturer to lecturer, but in general these are relatively simple formulas that occur frequently, and your lecturer will likely tell you (often several times) that you should know this formula.
  • Tip: if you want to recognize statistical symbols, you can use: Recognizing commonly used statistical symbols
  • Tip: have fun with LaTeX! LaTeX code gives us a simple way to write out mathematical formulas and make them look professional. Play with LaTeX. With that, you can include used formulas in your own papers and you learn to understand how a formula is built up – which greatly benefits your understanding and remembering that formula. See also (in Dutch): How to create formulas like a pro on JoHo WorldSupporter?
  • Tip: Are you interested in a career in sciences or programming? Then take your formulas seriously and go through them again after your course.

How to practice your statistics (with minimal effort)?

How to select your data?

  • Your teacher will regularly use a dataset for lessons during the first years of your studying. It is instructive (and can be a lot of fun) to set up your own research for once with real data that is also used by other researchers.
  • Tip: scientific articles often indicate which datasets have been used for the research. There is a good chance that those datasets are valid. Sometimes there are also studies that determine which datasets are more valid for the topic you want to study than others. Make use of datasets other researchers point out.
  • Tip: Do you want an interesting research result? You can use the same method and question, but use an alternative dataset, and/or alternative variables, and/or alternative location, and/or alternative time span. This allows you to validate or falsify the results of earlier research.
  • Tip: for datasets you can look at Discovering datasets for statistical research

How to operationalize clearly and smartly?

  • For the operationalization, it is usually sufficient to indicate the following three things:
    • What is the concept you want to study?
    • Which variable does that concept represent?
    • Which indicators do you select for those variables?
  • It is smart to argue that a variable is valid, or why you choose that indicator.
  • For example, if you want to know whether someone is currently a father or mother (concept), you can search the variables for how many children the respondent has (variable) and then select on the indicators greater than 0, or is not 0 (indicators). Where possible, use the terms 'concept', 'variable', 'indicator' and 'valid' in your communication. For example, as follows: “The variable [variable name] is a valid measure of the concept [concept name] (if applicable: source). The value [description of the value] is an indicator of [what you want to measure].” (ie.: The variable "Number of children" is a valid measure of the concept of parenthood. A value greater than 0 is an indicator of whether someone is currently a father or mother.)

How to run analyses and draw your conclusions?

  • The choice of your analyses depends, among other things, on what your research goal is, which methods are often used in the existing literature, and practical issues and limitations.
  • The more you learn, the more independently you can choose research methods that suit your research goal. In the beginning, follow the lecturer – at the end of your studies you will have a toolbox with which you can vary in your research yourself.
  • Try to link up as much as possible with research methods that are used in the existing literature, because otherwise you could be comparing apples with oranges. Deviating can sometimes lead to interesting results, but discuss this with your teacher first.
  • For as long as you need, keep a step-by-step plan at hand on how you can best run your analysis and achieve results. For every analysis you run, there is a step-by-step explanation of how to perform it; if you do not find it in your study literature, it can often be found quickly on the internet.
  • Tip: Practice a lot with statistics, so that you can show results quickly. You cannot learn statistics by just reading about it.
  • Tip: The measurement level of the variables you use (ratio, interval, ordinal, nominal) largely determines the research method you can use. Show your audience that you recognize this.
  • Tip: conclusions from statistical analyses will never be certain, but at the most likely. There is usually a standard formulation for each research method with which you can express the conclusions from that analysis and at the same time indicate that it is not certain. Use that standard wording when communicating about results from your analysis.
  • Tip: see explanation for various analyses: Introduction to statistics
Statistics: suggestions, summaries and tips for understanding statistics

Statistics: suggestions, summaries and tips for understanding statistics

Knowledge and assistance for classifying, illustrating, interpreting, demonstrating and discussing statistics.

Startmagazine: Introduction to Statistics
Understanding data: distributions, connections and gatherings
Understanding reliability and validity
Statistics Magazine: Understanding statistical samples
Understanding distributions in statistics
Understanding variability, variance and standard deviation
Understanding inferential statistics
Understanding type-I and type-II errors
Understanding effect size, proportion of explained variance and power of tests to your significant results
Statistiek en onderzoek - Thema
Statistics: suggestions, summaries and tips for applying statistics

Statistics: suggestions, summaries and tips for applying statistics

Knowledge and assistance for choosing, modeling, organizing, planning and utilizing statistics.

Applying z-tests and t-tests
Applying correlation, regression and linear regression
Applying spearman's correlation - Theme
Applying multiple regression

More knowledge and assistance for Encountering, Understanding and Applying Statistics

To see why distributions matter

What can you do on a WorldSupporter Statistics Topic?