Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)
- 3164 reads
Sample data is used for estimating parameters that give information about the population, such as proportions and means. For quantitative variables the population mean is estimated (like how much money on average is spent on medicine in a certain year). For categorical variables the population proportions are estimated for the categories (like how many people do and don't have medical insurance in a certain year).
Two kinds of parameter estimates exist;
A point estimate is a number that is the best prediction.
An interval estimate is an interval surrounding a point estimate, which you think contains the population parameter.
There is a difference between the estimator (the way that estimates are made) and the estimate point (the estimated number itself). For instance, a sample is an estimator for the population parameter and 0,73 is an estimate point of the population proportion that believes in love at first sight.
A good estimator has a sampling distribution that is centered around the parameter and that has a standard error as small as possible.
An estimate isn't biased when the sampling distribution is centered around the parameter. This is especially the case when the sample mean is the population parameter. In that case ӯ (sample mean) equals µ (population mean). ӯ is then regarded a good estimator for µ.
When an estimate is biased, the sample mean doesn't estimate the population mean well. Usually the sample mean is below, because the extremes from a sample can't be more than those of the population, only less. The sample variety is smaller, allowing the sample variety to underestimate the population variety.
An estimator should also have a small standard error. An estimator is called efficient when the standard error is smaller than that of other estimator. Imagine a normal distribution. The standard error of the median is 25% bigger than the standard error of the mean. The sample mean is closer to the population mean than the sample median is. The sample mean is a more efficient estimator then.
A good estimator is unbiased (meaning the sampling distribution is centered around the parameter) and efficient (meaning it has the smallest standard error).
Usually the sample mean serves as an estimator for the population mean, the sample standard deviation as an estimator for the population standard deviation, etc. This is indicated by a hat on a symbol, for instance(mu-hat) means an estimate of the population mean µ.
A confidence interval is an interval estimate for a parameter. Only reliable estimates of the parameter are in this interval. To find this interval, look at the sample distribution, which is a normal distribution. For a confidence interval with 95% security, the estimate of the parameter is within two standard errors of the mean. To calculate this, multiply the standard error with the z-score. Ad and subtract the outcome to the point estimate, so you get two numbers, that together form the confidence interval. Now it is 95% guaranteed that a population parameter lies in between these two numbers. The z-score multiplied with the standard error is also called the margin of error.
So a confidence interval is: point estimate ± margin of error. The confidence level is the chance that the parameter really falls within the confidence interval. This is a number close to 1, like 0.95 or 0.99.
Nominal and ordinal variables create categorical data (for instance 'agree' and 'not agree'). For this kind of data, means are useless. Instead, proportions or percentages are used. A proportion is between 0 and 1, a percentage between 0 and 100.
The unknown population proportion is written: π. The sample proportion is the point estimate of the population proportion, meaning the sample is used to estimated the population proportion. The sample proportion is indicated by the symbol .
A sample mean is a statistic from the sample so its distribution has the normal shape. Hence, the Central Limit Theorem is in place. Because it is a normal distribution, 95% falls within two standard deviations from the mean. This is regarded as the confidence interval. Calculating a confidence interval requires the standard error, but because this is often unknown for the population, the sample standard error is used instead. This is indicated as se. The formula for estimating the sample standard error is:
The standard error needs to be multiplied with the z-score. For a normal distribution the chance of z standard errors from the mean is the same as the confidence level. For confidence intervals of 95% and 99%, the z equals 1.96 and 2.58. A 95% confidence interval for the proportion π is:
± 1,96(se)
The general formula for a confidence interval is:
± z(se)
Confidence intervals are rounded off at two numbers.
A bigger sample generates a smaller standard error and a more accurate confidence interval. Specifically, the sample size needs to multiply by four to double the accuracy of the confidence interval.
The error probability is the chance that the parameter is outside of the estimated confidence interval. This is indicated as α (the Greek letter alpha), it is calculated as 1 – confidence level. If the confidence level is 0.98, then the error probability is 0.02.
When the sample is too small, the confidence interval doesn't say much because the error probability is too big. As a rule, at least 15 observations should fall within a category and at least 15 outside.
Finding the confidence interval for a mean goes roughly the same way as finding it for a proportion. Also for a mean the confidence interval is point estimate ± margin of error. In this case the margin of error consists of a t-score (instead of a z-score) multiplied with the standard error. The t-score is retrieved from the t-distribution, a distribution of the confidence intervals for all sample sizes, even tiny ones. The standard error is found by dividing the sample standard deviation s by the root of the sample size n. In this case the point estimate is the sample mean ȳ.
The formula for a 95% confidence interval for a population mean µ using the t-distribution is:
ȳ ± t0.025 (se) where se = and df = n – 1
For t-scores the confidence interval is a little wider than it normally is. The t-distribution looks like a normal distribution but it rises less high in the middle and its tails are a bit higher. It's symmetrical from the middle, where the mean 0 lies.
The standard deviation of the t-distribution is dependent on the degrees of freedom (df). With that, the standard deviation of the t-distribution is a bit bigger than 1. The formula for the degrees of freedom is: df = n – 1.
The bigger the degrees of freedom, the more the t-distribution looks like a normal distribution. It gets pointier. For df > 30 they are practically identical.
The t-scores can be found on the internet or in books about statistics. For instance, a 95% confidence interval has a t-score t0.025.
Robust means that a statistical method will hold even when a certain assumption is violated. Even for a distribution that isn't normal, the t-distribution can give a mean for a confidence level. However, for extreme outliers or very skewed distributions, this method doesn't work properly.
A standard normal distribution is a distribution with degrees of freedom that are infinite.
The t-distribution was discovered by Gosset while doing research for a brewery. He secretly published articles using Student as a name. Now, sometimes the t-distribution is named Student's t.
For determining sample size, the desired margin of error and the desired confidence level need to be decided upon. The desired margin of error is indicated as M.
The formula for finding the right sample size to estimate a population proportion is:
The z-score corresponds with the one for the chosen confidence interval, like 1.96. The z-score is determined by the chance that the margin of error isn't bigger than M. The sample proportion π can be guessed or can be estimated safely at 0,50.
The formula for finding the right sample size to estimate a population mean is:
Also here the z-score belongs to the chosen confidence level, like z = 1.96 for 0.95. The standard deviation of the population σ needs to be guessed.
The desired sample size depends on the margin of error and on the confidence level, but also on variability. Data with high variability requires a bigger sample size.
Also other factors influence choosing a sample size. The more complex the analysis and the more variables are relevant, the bigger the sample needs to be. Also time and money influence things. If it's unavoidable for a sample to be small, then for each category two fake observations are added, so that the formulas for the confidence interval remain useful.
Apart from means and proportions, also other statistics can describe data. To make point estimates, also for other statistics, R.A. Fisher developed a method called maximum likelihood. This method chooses the estimator of the parameter for which the likelihood is maximal. The likelihood can be portrayed like a curve, so visually it immediately becomes clear where the highest point of likelihood is located. The chance for finding a sample outcome with a certain value for a parameter shows how likely a parameter value is.
This method has three advantages, especially for big samples: 1) efficiency, other estimators don't have smaller standard errors or are closer to the parameter, 2) unbiased and 3) usually shaped like a normal distribution.
Fisher discovered that the mean is a more likely estimator than the median. Only for exceptions the median is better, like for very skewed data.
When even the shape of a population distribution is unknown, the bootstrap method can help. Software then treats the sample as if it were the population distribution and generates a new 'sample', this process is repeated many times. In this way, the bootstrap method can find the standard error and the confidence interval.
Join with a free account for more service, or become a member for full access to exclusives and extra support of WorldSupporter >>
Summary of Statistical methods for the social sciences by Agresti, 5th edition, 2018. Summary in English.
There are several ways to navigate the large amount of summaries, study notes en practice exams on JoHo WorldSupporter.
Do you want to share your summaries with JoHo WorldSupporter and its visitors?
Main summaries home pages:
Main study fields:
Business organization and economics, Communication & Marketing, Education & Pedagogic Sciences, International Relations and Politics, IT and Technology, Law & Administration, Medicine & Health Care, Nature & Environmental Sciences, Psychology and behavioral sciences, Science and academic Research, Society & Culture, Tourisme & Sports
Main study fields NL:
JoHo can really use your help! Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world
1952 |
Add new contribution