How can you make estimates for statistical inference? – Chapter 5

5.1 How do you make point estimates and interval estimates?
5.2 How do you calculate the confidence level for a proportion?
5.3 How do you calculate the confidence level for a mean?
5.4 How do you choose the sample size?
5.5 What do maximum likelihood and bootstrap methods do?

5.1 How do you make point estimates and interval estimates?

Sample data is used for estimating parameters that give information about the population, such as proportions and means. For quantitative variables the population mean is estimated (like how much money on average is spent on medicine in a certain year). For categorical variables the population proportions are estimated for the categories (like how many people do and don't have medical insurance in a certain year).

Two kinds of parameter estimates exist;

A point estimate is a number that is the best prediction.
An interval estimate is an interval surrounding a point estimate, which you think contains the population parameter.

There is a difference between the estimator (the way that estimates are made) and the estimate point (the estimated number itself). For instance, a sample is an estimator for the population parameter and 0,73 is an estimate point of the population proportion that believes in love at first sight.

A good estimator has a sampling distribution that is centered around the parameter and that has a standard error as small as possible.

An estimate isn't biased when the sampling distribution is centered around the parameter. This is especially the case when the sample mean is the population parameter. In that case ӯ (sample mean) equals µ (population mean). ӯ is then regarded a good estimator for µ.

When an estimate is biased, the sample mean doesn't estimate the population mean well. Usually the sample mean is below, because the extremes from a sample can't be more than those of the population, only less. The sample variety is smaller, allowing the sample variety to underestimate the population variety.

An estimator should also have a small standard error. An estimator is called efficient when the standard error is smaller than that of other estimator. Imagine a normal distribution. The standard error of the median is 25% bigger than the standard error of the mean. The sample mean is closer to the population mean than the sample median is. The sample mean is a more efficient estimator then.

A good estimator is unbiased (meaning the sampling distribution is centered around the parameter) and efficient (meaning it has the smallest standard error).

Usually the sample mean serves as an estimator for the population mean, the sample standard deviation as an estimator for the population standard deviation, etc. This is indicated by a hat on a symbol, for instance(mu-hat) means an estimate of the population mean µ.

A confidence interval is an interval estimate for a parameter. Only reliable estimates of the parameter are in this interval. To find this interval, look at the sample distribution, which is a normal distribution. For a confidence interval with 95% security, the estimate of the parameter is within two standard errors of the mean. To calculate this, multiply the standard error with the z-score. Ad and subtract the outcome to the point estimate, so you get two numbers, that together form the confidence interval. Now it is 95% guaranteed that a population parameter lies in between these two numbers. The z-score multiplied with the standard error is also called the margin of error.

So a confidence interval is: point estimate ± margin of error. The confidence level is the chance that the parameter really falls within the confidence interval. This is a number close to 1, like 0.95 or 0.99.

5.2 How do you calculate the confidence level for a proportion?

Nominal and ordinal variables create categorical data (for instance 'agree' and 'not agree'). For this kind of data, means are useless. Instead, proportions or percentages are used. A proportion is between 0 and 1, a percentage between 0 and 100.

The unknown population proportion is written: π. The sample proportion is the point estimate of the population proportion, meaning the sample is used to estimated the population proportion. The sample proportion is indicated by the symbol $mu hat$ .

A sample mean is a statistic from the sample so its distribution has the normal shape. Hence, the Central Limit Theorem is in place. Because it is a normal distribution, 95% falls within two standard deviations from the mean. This is regarded as the confidence interval. Calculating a confidence interval requires the standard error, but because this is often unknown for the population, the sample standard error is used instead. This is indicated as se. The formula for estimating the sample standard error is:

$se$

The standard error needs to be multiplied with the z-score. For a normal distribution the chance of z standard errors from the mean is the same as the confidence level. For confidence intervals of 95% and 99%, the z equals 1.96 and 2.58. A 95% confidence interval for the proportion π is:

$pi hat$ ± 1,96(se)

The general formula for a confidence interval is:

$pi hat$ ± z(se)

Confidence intervals are rounded off at two numbers.

A bigger sample generates a smaller standard error and a more accurate confidence interval. Specifically, the sample size needs to multiply by four to double the accuracy of the confidence interval.

The error probability is the chance that the parameter is outside of the estimated confidence interval. This is indicated as α (the Greek letter alpha), it is calculated as 1 – confidence level. If the confidence level is 0.98, then the error probability is 0.02.

When the sample is too small, the confidence interval doesn't say much because the error probability is too big. As a rule, at least 15 observations should fall within a category and at least 15 outside.

5.3 How do you calculate the confidence level for a mean?

Finding the confidence interval for a mean goes roughly the same way as finding it for a proportion. Also for a mean the confidence interval is point estimate ± margin of error. In this case the margin of error consists of a t-score (instead of a z-score) multiplied with the standard error. The t-score is retrieved from the t-distribution, a distribution of the confidence intervals for all sample sizes, even tiny ones. The standard error is found by dividing the sample standard deviation s by the root of the sample size n. In this case the point estimate is the sample mean ȳ.

The formula for a 95% confidence interval for a population mean µ using the t-distribution is:

ȳ ± t_0.025(se) where se = $se short$ and df = n – 1

For t-scores the confidence interval is a little wider than it normally is. The t-distribution looks like a normal distribution but it rises less high in the middle and its tails are a bit higher. It's symmetrical from the middle, where the mean 0 lies.

The standard deviation of the t-distribution is dependent on the degrees of freedom (df). With that, the standard deviation of the t-distribution is a bit bigger than 1. The formula for the degrees of freedom is: df = n – 1.

The bigger the degrees of freedom, the more the t-distribution looks like a normal distribution. It gets pointier. For df > 30 they are practically identical.

The t-scores can be found on the internet or in books about statistics. For instance, a 95% confidence interval has a t-score t_0.025.

Robust means that a statistical method will hold even when a certain assumption is violated. Even for a distribution that isn't normal, the t-distribution can give a mean for a confidence level. However, for extreme outliers or very skewed distributions, this method doesn't work properly.

A standard normal distribution is a distribution with degrees of freedom that are infinite.

The t-distribution was discovered by Gosset while doing research for a brewery. He secretly published articles using Student as a name. Now, sometimes the t-distribution is named Student's t.

5.4 How do you choose the sample size?

For determining sample size, the desired margin of error and the desired confidence level need to be decided upon. The desired margin of error is indicated as M.

The formula for finding the right sample size to estimate a population proportion is:

$n and M$

The z-score corresponds with the one for the chosen confidence interval, like 1.96. The z-score is determined by the chance that the margin of error isn't bigger than M. The sample proportion π can be guessed or can be estimated safely at 0,50.

The formula for finding the right sample size to estimate a population mean is:

$n and sigma$

Also here the z-score belongs to the chosen confidence level, like z = 1.96 for 0.95. The standard deviation of the population σ needs to be guessed.

The desired sample size depends on the margin of error and on the confidence level, but also on variability. Data with high variability requires a bigger sample size.

Also other factors influence choosing a sample size. The more complex the analysis and the more variables are relevant, the bigger the sample needs to be. Also time and money influence things. If it's unavoidable for a sample to be small, then for each category two fake observations are added, so that the formulas for the confidence interval remain useful.

5.5 What do maximum likelihood and bootstrap methods do?

Apart from means and proportions, also other statistics can describe data. To make point estimates, also for other statistics, R.A. Fisher developed a method called maximum likelihood. This method chooses the estimator of the parameter for which the likelihood is maximal. The likelihood can be portrayed like a curve, so visually it immediately becomes clear where the highest point of likelihood is located. The chance for finding a sample outcome with a certain value for a parameter shows how likely a parameter value is.

This method has three advantages, especially for big samples: 1) efficiency, other estimators don't have smaller standard errors or are closer to the parameter, 2) unbiased and 3) usually shaped like a normal distribution.

Fisher discovered that the mean is a more likely estimator than the median. Only for exceptions the median is better, like for very skewed data.

When even the shape of a population distribution is unknown, the bootstrap method can help. Software then treats the sample as if it were the population distribution and generates a new 'sample', this process is repeated many times. In this way, the bootstrap method can find the standard error and the confidence interval.

Access:

Public

Join WorldSupporter!

Join with a free account for more service, or become a member for full access to exclusives and extra support of WorldSupporter >>

This content is related to:

Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)

Check more of topic:

Samenvattingen voor psychologie en gedrag

Universiteit Groningen en studieverenigingen

This content is used in:

Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)

Going abroad?

Insure your way around the world

International expat insurances

Travel & Worldsupporter insurances (NL)

Study with summaries

Contributions: posts

Help other WorldSupporters with additions, improvements and tips

Spotlight: topics

Check the related and most recent topics and summaries:

Activities abroad, study fields and working areas:

Which kinds of samples and variables are possible? – Chapter 2

What are the main measures and graphs of descriptive statistics? - Chapter 3

What role do probability distributions play in statistical inference? – Chapter 4

How can you make estimates for statistical inference? – Chapter 5

How do you perform significance tests? – Chapter 6

How do you compare two groups in statistics? - Chapter 7

How do you analyze the association between categorical variables? – Chapter 8

How do linear regression and correlation work? – Chapter 9

Which types of multivariate relationships exist? – Chapter 10

What is multiple regression? – Chapter 11

What is ANOVA? – Chapter 12

How does multiple regression with both quantitative and categorical predictors work? – Chapter 13

How do you make a multiple regression model for extreme or strongly correlating data? – Chapter 14

What is logistic regression? – Chapter 15

Check how to use summaries on WorldSupporter.org

Submenu: Summaries & Activities

Follow the author: Annemarie JoHo

Work for WorldSupporter

JoHo can really use your help! Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world

Working for JoHo as a student in Leyden

Parttime werken voor JoHo

Statistics

Search a summary, study help or student organization

Select any filter and click on Search to see results

How can you make estimates for statistical inference? – Chapter 5

5.1 How do you make point estimates and interval estimates?

5.2 How do you calculate the confidence level for a proportion?

5.3 How do you calculate the confidence level for a mean?

5.4 How do you choose the sample size?

5.5 What do maximum likelihood and bootstrap methods do?

Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)

Samenvattingen voor psychologie en gedrag

Universiteit Groningen en studieverenigingen

Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)

Contributions: posts

Add new contribution

Spotlight: topics

Samenvattingen voor psychologie en gedrag

Development Goal 04: Quality Education

Universiteit Groningen en studieverenigingen

Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)

Online access to all summaries, study notes en practice exams

How and why use WorldSupporter.org for your summaries and study assistance?

Using and finding summaries, notes and practice exams on JoHo WorldSupporter

Quicklinks to fields of study for summaries and study assistance