What role do probability distributions play in statistical inference? – Chapter 4

4.1 What are the basic rules of probability?
4.2 What is the difference in probability distributions for discrete and continuous variables?
4.3 How does the normal distribution work exactly?
4.4 What is the difference between sample distributions and sampling distributions?
4.5 How do you create the sampling distribution for a sample mean?
4.6 What is the connection between the population, the sample data and the sampling distribution?

4.1 What are the basic rules of probability?

Randomization is important for collecting data, the idea that the possible observations are known but it's yet unknown which possibility will prevail. What will happen, depends on probability. The probability is the proportion of the number of times that a certain observation is prevalent in a long sequence of similar observations. The fact that the sequence is long, is important, because the longer the sequence, the more accurate the probability. Then the sample proportion becomes more like the population proportion. Probabilities can also be measured in percentages (such as 70%) instead of proportions (such as 0.7). A specific branch within statistics deals with subjective probabilities, called Bayesian statistics. However, most of statistics is about regular probabilities.

A probability is written like P(A), where P is the probability and A is an outcome. If two outcomes are possible and they exclude each other, then the chance that B happens is 1- P(A).

Imagine research about people's favorite colors, whether this is mostly red and blue. Again the assumption is made that the possibilities exclude each other without overlapping. The chance that someone's favorite color is red (A) or blue (B), is P(A of B) = P (A) + P (B).

Next, imagine research that encompasses multiple questions. The research seeks to investigate how many married people have kids. Then you can multiply the chance that someone is married (A) with the chance that someone has kids (B).The formula for this is: P(A and B) = P(A) * P(B if also A). Because there is a connection between A and B, this is called a conditional probability.

Now, imagine researching multiple possibilities that are not connected. The chance that a random person likes to wear sweaters (A) and the chance that another random person likes to wear sweaters (B), is P (A and B) = P (A) x P (B). These are independent probabilities.

4.2 What is the difference in probability distributions for discrete and continuous variables?

A random variable means that the outcome differs for each observation, but mostly this is just referred to as a variable. While a discrete variable has set possible values, a continuous variable can assume any value. Because a probability distribution shows the chances for each value a variable can take, this is different for discrete and continuous variables.

For a discrete variable a probability distribution gives the chances for each possible value. Every probability is a number between 0 and 1. The sum of all probabilities is 1. The probabilities are written P(y), where P is the probability that y has a certain value. In a formula: 0 ≤ P(y) ≤ 1, and ∑_{all y} P(y) = 1.

Because a continuous variable has unlimited possible values, a probability distribution can't show them all. Instead a probability distribution for continuous variables shows intervals of possible values. The probability that a value falls within a certain interval, is between 0 and 1. When an interval in a graph contains 20% of data, then the probability that a value falls within that interval is 0,20.

Probability distribution with intervals

Just like a population distribution, a probability distribution has population parameters that describe the data. The mean describes the center and the standard deviation the variability. The formula for calculating a mean of the population distribution for a discrete variable is: µ = ∑ y P(y). This parameter is called the expected value of y and in written form it's E(y).

4.3 How does the normal distribution work exactly?

The normal distribution is useful because many variables have a similar distribution and the normal distribution can help to make statistical predictions. The normal distribution is symmetrical, shaped like a bell and it has a mean (µ) and a standard deviation (σ). The empirical rule is applicable to the normal distribution: 68% falls within 1 standard deviation, 95% within 2 standard deviations and 97% within 3 standard deviations. The normal distribution looks like this:

Normal probability distribution, Normale kansverdeling

The number of standard deviations is indicated as z. Software such as R, SPSS, Stata and SAS can find probabilities for a normal distribution. In case of a symmetrical curve, the probabilities are cumulative, meaning that z has the same distance to the mean on the left and on the right. The formula for z is: z = (y - µ) / σ.

The z-score is the number of standard deviations that a variable y is distanced from the mean. A positive z-score means that y falls above the mean, a negative score means that it falls below.

Otherwise, when P is known, then y can be found. Software helps to find the z-score for finding probabilities in a distribution. The formula for y is µ + z σ.

A special kind of normal distribution is the standard normal distribution, which consists of z-scores. A variable y can be converted to z by subtracting the mean and then dividing it by the standard deviation. Then a distribution is created where µ = 0 and σ = 1.

A bivariate normal distribution is used for bivariate probabilities. In case of two variables (y and x), there are two means (µ_y and µ_x) and two standard deviations (σ_y and σ_x). The covariance is the way that y and x vary together:

Covariance (x, y) = E[(x – µ_x)(y – µ_y)]

4.4 What is the difference between sample distributions and sampling distributions?

A simulation can tell whether an outcome of a test such as a poll is a good representation of the population. Software can generate random numbers.

When the characteristics of the population are unknown, samples are used. Statistics from samples give information about the expected parameters for the population. A sampling distribution shows the probabilities for sample measures (this is not the same as a sample distribution that shows the outcome of the data). For every statistic there is a sampling distribution, such as for the sample median, sample mean etc. This kind of distribution shows the probabilities that certain outcomes of that statistic may happen.

A sampling distribution serves to estimate how close a statistic lies to its parameter. A sampling distribution for a statistic based on n observations is the relative frequency distribution of that statistic, that in turn is the result of repeated samples of n. A sampling distribution can be formed using repeated samples but generally its form is known already. The sampling distribution allows to find probabilities for the values of a statistic of a sample with n observations.

4.5 How do you create the sampling distribution for a sample mean?

When the sample mean is known, its proximity to the population mean may still be a mystery. It's still unknown whether ȳ = µ. However, the sampling distribution creates indications, for instance a high probability that ȳ is within ten values of µ. In the end, when a lot of samples are drawn, the mean of a sampling distribution equals the mean of the population.

The variability of the sampling distribution of ȳ is described by the standard deviation of ȳ, called the standard error of ȳ. This is written as σ_ȳ. The formula for finding the standard error is:

σ_ȳ= Standard error, standaardfout

The standard error indicates how much the mean varies per sample, this says something about how valuable the samples are.

For a random sample of size n, the standard error of ȳ depends on the standard deviation of the population (σ). When n gets bigger, the standard error becomes smaller. This means that a bigger sample represents the population better. The fact that the sample mean and the population mean are different, is called the sampling error.

The standard error and the sampling error are two different things. The sampling error indicates that the sample and the population are different in terms of the mean. The standard error measures how much samples differ from each other in terms of the mean.

No matter how the population distribution is shaped, the sampling distribution of ȳ is always a normal distribution. This is called the Central Limit Theorem. Even if the population distribution has very discrete values, the sampling distribution is a normal distribution. However, when the population is very skewed over a matter, the sample needs to be big for the sampling distribution to have the normal shape. For small samples the Central Limit Theorem can't necessarily be used.

Just like the standard error, the Central Limit Theorem is useful for finding information about the sampling distribution and the sample mean ȳ. Because it has a normal distribution, the Empirical Rule can be applied.

4.6 What is the connection between the population, the sample data and the sampling distribution?

To understand sampling, distinguishing between three distributions is important:

The population distribution describes the entirety from which the sample is drawn. The parameters µ and σ denote the population mean and the population standard deviation.
The sample data distribution portrays the variability of the observations made in the sample. The sample mean ȳ and the sample standard deviation s describe the curve.
The sampling distribution shows the probabilities that a statistic from the sample, such as the sample mean, has certain values. It tells how much samples can differ.

The Central Limit Theorem says that the sampling distribution is shaped like a normal distribution. Information can be deducted just from this shape. The possibility to retrieve information from the shape, is the reason that the normal distribution is very important to statistics.

Access:

Public

Join WorldSupporter!

Join with a free account for more service, or become a member for full access to exclusives and extra support of WorldSupporter >>

This content is related to:

Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)

Check more of topic:

Samenvattingen voor psychologie en gedrag

Universiteit Groningen en studieverenigingen

This content is used in:

Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)

Selected contributions for Data: distributions, connections and gatherings

Going abroad?

Insure your way around the world

International expat insurances

Travel & Worldsupporter insurances (NL)

Study with summaries

Contributions: posts

Help other WorldSupporters with additions, improvements and tips

Spotlight: topics

Check the related and most recent topics and summaries:

Activities abroad, study fields and working areas:

Samenvattingen voor psychologie en gedrag

Institutions, jobs and organizations:

Universiteit Groningen en studieverenigingen

This content is also used in .....

Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)

Summary of Statistical methods for the social sciences by Agresti, 5th edition, 2018. Summary in English.

What are statistical methods? – Chapter 1

Which kinds of samples and variables are possible? – Chapter 2

What are the main measures and graphs of descriptive statistics? - Chapter 3

What role do probability distributions play in statistical inference? – Chapter 4

How can you make estimates for statistical inference? – Chapter 5

How do you perform significance tests? – Chapter 6

How do you compare two groups in statistics? - Chapter 7

How do you analyze the association between categorical variables? – Chapter 8

How do linear regression and correlation work? – Chapter 9

Which types of multivariate relationships exist? – Chapter 10

What is multiple regression? – Chapter 11

What is ANOVA? – Chapter 12

How does multiple regression with both quantitative and categorical predictors work? – Chapter 13

How do you make a multiple regression model for extreme or strongly correlating data? – Chapter 14

What is logistic regression? – Chapter 15

Selected contributions for Data: distributions, connections and gatherings

Selected contributions of other WorldSupporters on the topic of Data: distributions, connections and gatherings

Which kinds of samples and variables are possible? – Chapter 2

What are the main measures and graphs of descriptive statistics? - Chapter 3

What role do probability distributions play in statistical inference? – Chapter 4

Call to action: Do you have statistical knowledge and skills and do you enjoy helping others while expanding your international network?

Understanding data: distributions, connections and gatherings

Check how to use summaries on WorldSupporter.org

Submenu: Summaries & Activities

Follow the author: Annemarie JoHo

Work for WorldSupporter

JoHo can really use your help! Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world

Working for JoHo as a student in Leyden

Parttime werken voor JoHo

Statistics

Search a summary, study help or student organization

Select any filter and click on Search to see results

What role do probability distributions play in statistical inference? – Chapter 4

4.1 What are the basic rules of probability?

4.2 What is the difference in probability distributions for discrete and continuous variables?

4.3 How does the normal distribution work exactly?

4.4 What is the difference between sample distributions and sampling distributions?

4.5 How do you create the sampling distribution for a sample mean?

4.6 What is the connection between the population, the sample data and the sampling distribution?

Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)

Samenvattingen voor psychologie en gedrag

Universiteit Groningen en studieverenigingen

Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)

Selected contributions for Data: distributions, connections and gatherings

Contributions: posts

Add new contribution

Spotlight: topics

Samenvattingen voor psychologie en gedrag

Universiteit Groningen en studieverenigingen

Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)

Selected contributions for Data: distributions, connections and gatherings

Online access to all summaries, study notes en practice exams

How and why use WorldSupporter.org for your summaries and study assistance?

Using and finding summaries, notes and practice exams on JoHo WorldSupporter

Quicklinks to fields of study for summaries and study assistance