Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)
- 2894 reads
Randomization is important for collecting data, the idea that the possible observations are known but it's yet unknown which possibility will prevail. What will happen, depends on probability. The probability is the proportion of the number of times that a certain observation is prevalent in a long sequence of similar observations. The fact that the sequence is long, is important, because the longer the sequence, the more accurate the probability. Then the sample proportion becomes more like the population proportion. Probabilities can also be measured in percentages (such as 70%) instead of proportions (such as 0.7). A specific branch within statistics deals with subjective probabilities, called Bayesian statistics. However, most of statistics is about regular probabilities.
A probability is written like P(A), where P is the probability and A is an outcome. If two outcomes are possible and they exclude each other, then the chance that B happens is 1- P(A).
Imagine research about people's favorite colors, whether this is mostly red and blue. Again the assumption is made that the possibilities exclude each other without overlapping. The chance that someone's favorite color is red (A) or blue (B), is P(A of B) = P (A) + P (B).
Next, imagine research that encompasses multiple questions. The research seeks to investigate how many married people have kids. Then you can multiply the chance that someone is married (A) with the chance that someone has kids (B).The formula for this is: P(A and B) = P(A) * P(B if also A). Because there is a connection between A and B, this is called a conditional probability.
Now, imagine researching multiple possibilities that are not connected. The chance that a random person likes to wear sweaters (A) and the chance that another random person likes to wear sweaters (B), is P (A and B) = P (A) x P (B). These are independent probabilities.
A random variable means that the outcome differs for each observation, but mostly this is just referred to as a variable. While a discrete variable has set possible values, a continuous variable can assume any value. Because a probability distribution shows the chances for each value a variable can take, this is different for discrete and continuous variables.
For a discrete variable a probability distribution gives the chances for each possible value. Every probability is a number between 0 and 1. The sum of all probabilities is 1. The probabilities are written P(y), where P is the probability that y has a certain value. In a formula: 0 ≤ P(y) ≤ 1, and ∑all y P(y) = 1.
Because a continuous variable has unlimited possible values, a probability distribution can't show them all. Instead a probability distribution for continuous variables shows intervals of possible values. The probability that a value falls within a certain interval, is between 0 and 1. When an interval in a graph contains 20% of data, then the probability that a value falls within that interval is 0,20.
Just like a population distribution, a probability distribution has population parameters that describe the data. The mean describes the center and the standard deviation the variability. The formula for calculating a mean of the population distribution for a discrete variable is: µ = ∑ y P(y). This parameter is called the expected value of y and in written form it's E(y).
The normal distribution is useful because many variables have a similar distribution and the normal distribution can help to make statistical predictions. The normal distribution is symmetrical, shaped like a bell and it has a mean (µ) and a standard deviation (σ). The empirical rule is applicable to the normal distribution: 68% falls within 1 standard deviation, 95% within 2 standard deviations and 97% within 3 standard deviations. The normal distribution looks like this:
The number of standard deviations is indicated as z. Software such as R, SPSS, Stata and SAS can find probabilities for a normal distribution. In case of a symmetrical curve, the probabilities are cumulative, meaning that z has the same distance to the mean on the left and on the right. The formula for z is: z = (y - µ) / σ.
The z-score is the number of standard deviations that a variable y is distanced from the mean. A positive z-score means that y falls above the mean, a negative score means that it falls below.
Otherwise, when P is known, then y can be found. Software helps to find the z-score for finding probabilities in a distribution. The formula for y is µ + z σ.
A special kind of normal distribution is the standard normal distribution, which consists of z-scores. A variable y can be converted to z by subtracting the mean and then dividing it by the standard deviation. Then a distribution is created where µ = 0 and σ = 1.
A bivariate normal distribution is used for bivariate probabilities. In case of two variables (y and x), there are two means (µy and µx) and two standard deviations (σy and σx). The covariance is the way that y and x vary together:
Covariance (x, y) = E[(x – µx)(y – µy)]
A simulation can tell whether an outcome of a test such as a poll is a good representation of the population. Software can generate random numbers.
When the characteristics of the population are unknown, samples are used. Statistics from samples give information about the expected parameters for the population. A sampling distribution shows the probabilities for sample measures (this is not the same as a sample distribution that shows the outcome of the data). For every statistic there is a sampling distribution, such as for the sample median, sample mean etc. This kind of distribution shows the probabilities that certain outcomes of that statistic may happen.
A sampling distribution serves to estimate how close a statistic lies to its parameter. A sampling distribution for a statistic based on n observations is the relative frequency distribution of that statistic, that in turn is the result of repeated samples of n. A sampling distribution can be formed using repeated samples but generally its form is known already. The sampling distribution allows to find probabilities for the values of a statistic of a sample with n observations.
When the sample mean is known, its proximity to the population mean may still be a mystery. It's still unknown whether ȳ = µ. However, the sampling distribution creates indications, for instance a high probability that ȳ is within ten values of µ. In the end, when a lot of samples are drawn, the mean of a sampling distribution equals the mean of the population.
The variability of the sampling distribution of ȳ is described by the standard deviation of ȳ, called the standard error of ȳ. This is written as σȳ. The formula for finding the standard error is:
σȳ =
The standard error indicates how much the mean varies per sample, this says something about how valuable the samples are.
For a random sample of size n, the standard error of ȳ depends on the standard deviation of the population (σ). When n gets bigger, the standard error becomes smaller. This means that a bigger sample represents the population better. The fact that the sample mean and the population mean are different, is called the sampling error.
The standard error and the sampling error are two different things. The sampling error indicates that the sample and the population are different in terms of the mean. The standard error measures how much samples differ from each other in terms of the mean.
No matter how the population distribution is shaped, the sampling distribution of ȳ is always a normal distribution. This is called the Central Limit Theorem. Even if the population distribution has very discrete values, the sampling distribution is a normal distribution. However, when the population is very skewed over a matter, the sample needs to be big for the sampling distribution to have the normal shape. For small samples the Central Limit Theorem can't necessarily be used.
Just like the standard error, the Central Limit Theorem is useful for finding information about the sampling distribution and the sample mean ȳ. Because it has a normal distribution, the Empirical Rule can be applied.
To understand sampling, distinguishing between three distributions is important:
The population distribution describes the entirety from which the sample is drawn. The parameters µ and σ denote the population mean and the population standard deviation.
The sample data distribution portrays the variability of the observations made in the sample. The sample mean ȳ and the sample standard deviation s describe the curve.
The sampling distribution shows the probabilities that a statistic from the sample, such as the sample mean, has certain values. It tells how much samples can differ.
The Central Limit Theorem says that the sampling distribution is shaped like a normal distribution. Information can be deducted just from this shape. The possibility to retrieve information from the shape, is the reason that the normal distribution is very important to statistics.
Join with a free account for more service, or become a member for full access to exclusives and extra support of WorldSupporter >>
Summary of Statistical methods for the social sciences by Agresti, 5th edition, 2018. Summary in English.
Selected contributions of other WorldSupporters on the topic of Data: distributions, connections and gatherings
There are several ways to navigate the large amount of summaries, study notes en practice exams on JoHo WorldSupporter.
Do you want to share your summaries with JoHo WorldSupporter and its visitors?
Main summaries home pages:
Main study fields:
Business organization and economics, Communication & Marketing, Education & Pedagogic Sciences, International Relations and Politics, IT and Technology, Law & Administration, Medicine & Health Care, Nature & Environmental Sciences, Psychology and behavioral sciences, Science and academic Research, Society & Culture, Tourisme & Sports
Main study fields NL:
JoHo can really use your help! Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world
1940 | 2 |
Add new contribution