What role do probability distributions play in statistical inference? – Chapter 4

4.1 What are the basic rules of probability?

Randomization is important for collecting data, the idea that the possible observations are known but it's yet unknown which possibility will prevail. What will happen, depends on probability. The probability is the proportion of the number of times that a certain observation is prevalent in a long sequence of similar observations. The fact that the sequence is long, is important, because the longer the sequence, the more accurate the probability. Then the sample proportion becomes more like the population proportion. Probabilities can also be measured in percentages (such as 70%) instead of proportions (such as 0.7). A specific branch within statistics deals with subjective probabilities, called Bayesian statistics. However, most of statistics is about regular probabilities.

A probability is written like P(A), where P is the probability and A is an outcome. If two outcomes are possible and they exclude each other, then the chance that B happens is 1- P(A).

Imagine research about people's favorite colors, whether this is mostly red and blue. Again the assumption is made that the possibilities exclude each other without overlapping. The chance that someone's favorite color is red (A) or blue (B), is P(A of B) = P (A) + P (B).

Next, imagine research that encompasses multiple questions. The research seeks to investigate how many married people have kids. Then you can multiply the chance that someone is married (A) with the chance that someone has kids (B).The formula for this is: P(A and B) = P(A) * P(B if also A). Because there is a connection between A and B, this is called a conditional probability.

Now, imagine researching multiple possibilities that are not connected. The chance that a random person likes to wear sweaters (A) and the chance that another random person likes to wear sweaters (B), is P (A and B) = P (A) x P (B). These are independent probabilities.

4.2 What is the difference in probability distributions for discrete and continuous variables?

A random variable means that the outcome differs for each observation, but mostly this is just referred to as a variable. While a discrete variable has set possible values, a continuous variable can assume any value. Because a probability distribution shows the chances for each value a variable can take, this is different for discrete and continuous variables.

For a discrete variable a probability distribution gives the chances for each possible value. Every probability is a number between 0 and 1. The sum of all probabilities is 1. The probabilities are written P(y), where P is the probability that y has a certain value. In a formula: 0 ≤ P(y) ≤ 1, and all y P(y) = 1.

Because a continuous variable has unlimited possible values, a probability distribution can't show them all. Instead a probability distribution for continuous variables shows intervals of possible values. The probability that a value falls within a certain interval, is between 0 and 1. When an interval in a graph contains 20% of data, then the probability that a value falls within that interval is 0,20.

Probability distribution with intervals

Just like a population distribution, a probability distribution has population parameters that describe the data. The mean describes the center and the standard deviation the variability. The formula for calculating a mean of the population distribution for a discrete variable is: µ = ∑ y P(y). This parameter is called the expected value of y and in written form it's E(y).

4.3 How does the normal distribution work exactly?

The normal distribution is useful because many variables have a similar distribution and the normal distribution can help to make statistical predictions. The normal distribution is symmetrical, shaped like a bell and it has a mean (µ) and a standard deviation (σ). The empirical rule is applicable to the normal distribution: 68% falls within 1 standard deviation, 95% within 2 standard deviations and 97% within 3 standard deviations. The normal distribution looks like this:

Normal probability distribution, Normale kansverdeling

The number of standard deviations is indicated as z. Software such as R, SPSS, Stata and SAS can find probabilities for a normal distribution. In case of a symmetrical curve, the probabilities are cumulative, meaning that z has the same distance to the mean on the left and on the right. The formula for z is: z = (y - µ) / σ.

The z-score is the number of standard deviations that a variable y is distanced from the mean. A positive z-score means that y falls above the mean, a negative score means that it falls below.

Otherwise, when P is known, then y can be found. Software helps to find the z-score for finding probabilities in a distribution. The formula for y is µ + z σ.

A special kind of normal distribution is the standard normal distribution, which consists of z-scores. A variable y can be converted to z by subtracting the mean and then dividing it by the standard deviation. Then a distribution is created where µ = 0 and σ = 1.

A bivariate normal distribution is used for bivariate probabilities. In case of two variables (y and x), there are two means (µy and µx) and two standard deviations (σy and σx). The covariance is the way that y and x vary together:

Covariance (x, y) = E[(x – µx)(y – µy)]

4.4 What is the difference between sample distributions and sampling distributions?

A simulation can tell whether an outcome of a test such as a poll is a good representation of the population. Software can generate random numbers.

When the characteristics of the population are unknown, samples are used. Statistics from samples give information about the expected parameters for the population. A sampling distribution shows the probabilities for sample measures (this is not the same as a sample distribution that shows the outcome of the data). For every statistic there is a sampling distribution, such as for the sample median, sample mean etc. This kind of distribution shows the probabilities that certain outcomes of that statistic may happen.

A sampling distribution serves to estimate how close a statistic lies to its parameter. A sampling distribution for a statistic based on n observations is the relative frequency distribution of that statistic, that in turn is the result of repeated samples of n. A sampling distribution can be formed using repeated samples but generally its form is known already. The sampling distribution allows to find probabilities for the values of a statistic of a sample with n observations.

4.5 How do you create the sampling distribution for a sample mean?

When the sample mean is known, its proximity to the population mean may still be a mystery. It's still unknown whether ȳ = µ. However, the sampling distribution creates indications, for instance a high probability that ȳ is within ten values of µ. In the end, when a lot of samples are drawn, the mean of a sampling distribution equals the mean of the population.

The variability of the sampling distribution of ȳ is described by the standard deviation of ȳ, called the standard error of ȳ. This is written as σȳ. The formula for finding the standard error is:

σȳ  Standard error, standaardfout 

The standard error indicates how much the mean varies per sample, this says something about how valuable the samples are.

For a random sample of size n, the standard error of ȳ depends on the standard deviation of the population (σ). When n gets bigger, the standard error becomes smaller. This means that a bigger sample represents the population better. The fact that the sample mean and the population mean are different, is called the sampling error.

The standard error and the sampling error are two different things. The sampling error indicates that the sample and the population are different in terms of the mean. The standard error measures how much samples differ from each other in terms of the mean.

No matter how the population distribution is shaped, the sampling distribution of ȳ is always a normal distribution. This is called the Central Limit Theorem. Even if the population distribution has very discrete values, the sampling distribution is a normal distribution. However, when the population is very skewed over a matter, the sample needs to be big for the sampling distribution to have the normal shape. For small samples the Central Limit Theorem can't necessarily be used.

Just like the standard error, the Central Limit Theorem is useful for finding information about the sampling distribution and the sample mean ȳ. Because it has a normal distribution, the Empirical Rule can be applied.

4.6 What is the connection between the population, the sample data and the sampling distribution?

To understand sampling, distinguishing between three distributions is important:

  1. The population distribution describes the entirety from which the sample is drawn. The parameters µ and σ denote the population mean and the population standard deviation.

  2. The sample data distribution portrays the variability of the observations made in the sample. The sample mean ȳ and the sample standard deviation s describe the curve.

  3. The sampling distribution shows the probabilities that a statistic from the sample, such as the sample mean, has certain values. It tells how much samples can differ.

The Central Limit Theorem says that the sampling distribution is shaped like a normal distribution. Information can be deducted just from this shape. The possibility to retrieve information from the shape, is the reason that the normal distribution is very important to statistics.

Image

Access: 
Public

Image

Join WorldSupporter!
This content is used in:

Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)

Selected contributions for Data: distributions, connections and gatherings

Search a summary

Image

 

 

Contributions: posts

Help other WorldSupporters with additions, improvements and tips

Add new contribution

CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image.

Image

Spotlight: topics

Check the related and most recent topics and summaries:
Institutions, jobs and organizations:
Activity abroad, study field of working area:
This content is also used in .....

Image

Check how to use summaries on WorldSupporter.org

Online access to all summaries, study notes en practice exams

How and why use WorldSupporter.org for your summaries and study assistance?

  • For free use of many of the summaries and study aids provided or collected by your fellow students.
  • For free use of many of the lecture and study group notes, exam questions and practice questions.
  • For use of all exclusive summaries and study assistance for those who are member with JoHo WorldSupporter with online access
  • For compiling your own materials and contributions with relevant study help
  • For sharing and finding relevant and interesting summaries, documents, notes, blogs, tips, videos, discussions, activities, recipes, side jobs and more.

Using and finding summaries, notes and practice exams on JoHo WorldSupporter

There are several ways to navigate the large amount of summaries, study notes en practice exams on JoHo WorldSupporter.

  1. Use the summaries home pages for your study or field of study
  2. Use the check and search pages for summaries and study aids by field of study, subject or faculty
  3. Use and follow your (study) organization
    • by using your own student organization as a starting point, and continuing to follow it, easily discover which study materials are relevant to you
    • this option is only available through partner organizations
  4. Check or follow authors or other WorldSupporters
  5. Use the menu above each page to go to the main theme pages for summaries
    • Theme pages can be found for international studies as well as Dutch studies

Do you want to share your summaries with JoHo WorldSupporter and its visitors?

Quicklinks to fields of study for summaries and study assistance

Main summaries home pages:

Main study fields:

Main study fields NL:

Follow the author: Annemarie JoHo
Work for WorldSupporter

Image

JoHo can really use your help!  Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world

Working for JoHo as a student in Leyden

Parttime werken voor JoHo

Statistics
1940 2