Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 6 summary

SUMMARIZING POSSIBLE OUTCOMES AND THEIR PROBABILITIES
All possible outcomes and probabilities are summarized in a probability distribution. There is a normal and a binomial distribution. A random variable is a numerical measurement of the outcome of a random phenomenon. The probability distribution of a discrete random variable assigns a probability to each possible value. Numerical summaries of the population are called parameters and a population distribution is a type of probability distribution, one that applies for selecting a subject at random from a population.

The formula for the mean of a probability distribution for a discrete random variable is:

μ= ΣxP(x)

It is also called a weighted average, because some outcomes are likelier to occur than others, so a regular mean would be insufficient here. The mean of a probability distribution of random variable X is also called the expected value of X. The standard deviation of a probability distribution measures the variability from the mean. It describes how far values of the random variable fall, on the average, from the expected value of the distribution. A continuous variable is measured in a discrete manner, because of rounding. A probability distribution for a continuous random variable is used to approximate the probability distribution for the possible rounded values.

PROBABILITIES FOR BELL-SHAPED DISTRIBUTIONS

The z-score for a value x of a random variable is the number of standard deviations that x falls from the mean. It is calculated as:

 

The standard normal distribution is the normal distribution with mean  and standard deviation  . It is the distribution of normal z-scores.

PROBABILITIES WHEN EACH OBSERVATION HAS TWO POSSIBLE OUTCOMES
An observation is binary if it has one of two possible outcomes (e.g: accept or decline, yes or no). A random variable X that counts the number of observations of a particular type has a probability distribution called the binomial distribution. There are a few conditions for a binomial distribution:

  1. Two possible outcomes
    Each trial has two possible outcomes.
  2. Same probability of success
    Each trial has the same probability of success
  3. Trials are independent

The formula for the binomial probabilities for any n is:

The binomial distribution is valid if the sample size is less than 10% of the population. There are a couple of formulas for the binomial distribution:

 and

 

Check page access:
Public
Check more or recent content:

Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Book summary

Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 1 summary

Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 1 summary

Image

USING DATA TO ANSWER STATISTICAL QUESTIONS
The information we gather with experiments and surveys is collectively called data. Statistics is the art and science of learning from data. Statistical problem solving consists of four things:

  1. Formulate a statistical question
  2. Collect data
  3. Analyse data
  4. Interpret results

The three main components of statistics for answering a statistical question are:

  1. Design
    Stating the goal and/or statistical question of interest and planning how to obtain data that will address them. (e.g: how do you conduct an experiment to determine the effects of ‘X’)
  2. Description
    Summarizing and analysing the data that are obtained (e.g: summarizing people’s tv-habits in ‘hours of tv watched per day’)
  3. Inference
    Making decisions and predictions based on the data for answering the statistical question. (predicting the outcome of an election, based on the description of the data)

Probability is a framework for quantifying how likely various possible outcomes are.

SAMPLE VERSUS POPULATION
The entities that are measured in a study are called the subjects. This usually means people, but it can also be schools, countries or days. The population is the set of all the subjects of interest. In practice, we usually have data for only some of the subjects who belong to that population. These subjects are called a sample.

Descriptive statistics refers to methods for summarizing the collected data. The summaries usually consist of graphs and numbers such as averages and percentages. Inferential statistics are used when data are available from a sample only, but we want to make a decision or prediction about the entire population. Inferential statistics refers to methods of making decisions or predictions about a population, based on data obtained from a sample of that population.

A parameter is a numerical summary of the population. A statistic is a numerical summary of a sample taken from the population. The true parameter values are almost always unknown, thus we use sample statistics to estimate the parameter values.

A sample is random when everyone in the population has the same chance of being included in the sample. Random sampling allows us to make powerful inferences about populations. The margin of error is a measure of the expected variability from one random sample to the next random sample.

The formula for calculating the approximate margin of error is:  . In this case, ‘n’ is the number of subjects.

Access: 
Public
Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 2 summary

Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 2 summary

Image

DIFFERENT TYPES OF DATA
A variable is any characteristic observed in a study. The data values that we observe for a variable are called observations. A variable can be categorical and quantitative.

  • Categorical variables are variables that belong to a distinct set of categories. A categorical variable can be numerical, because some variables do not vary in quantity. (e.g: religion, favourite sport, bank account, area codes)
  • Quantitative variables are variables that have numerical values and represent different magnitudes. (e.g: weight, height, hours spent watching TV every day)

Key features to describe quantitative variables are the centre and the variability (spread) of the data (e.g: average amount of hours spent watching TV every day). Key feature to describe categorical variables is the relative number of observations in various categories. (e.g: the percentage of days in a year that it was sunny)

Quantitative variables can be discrete and continuous. A quantitative variable is discrete if its possible values form a set of separate numbers, such as 0, 1, 2, 3 (e.g: the number of pets in a household). A quantitative variable is continuous if its possible values form an interval, such as 0.16, 0,13, 2,32 (e.g: weight: 68,3 kg).

The distribution of a variable describes how the observations fall (are distributed) across the range of possible values. The modal category is the category with the largest frequency.

A frequency table is a listing of possible values for a variable, together with the number of observations for each value.

Category

A

B

C

Frequency

17

23

9

Proportion

0.347

0.469

0.184

Percentage

34.7%

46.9%

.....read more
Access: 
Public
Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 3 summary

Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 3 summary

Image

THE ASSOCIATION BETWEEN TWO CATEGORICAL VARIABLES
When analysing data the first step is to distinguish between the response variable and the explanatory variable. The response variable is the outcome variable on which comparisons are made. If the explanatory variable is categorical, it defines the groups to be compared with respect to values for the response variable. If the explanatory variable is quantitative, it defines the change in different numerical values to be compared with respect to values for the response variable. The explanatory variable should explain the response variable (e.g: survival status is a response variable and smoking status is the explanatory variable).

An association exists between two variables if a particular value for one variable is more likely to occur with certain values of the other variable.

A contingency table is a display for two categorical variables. Conditional proportions are proportions which formation is conditional on ‘x’. A conditional proportion should be conditional to something. A conditional proportion is also a percentage. The proportion of the totals (e.g: percentage of total amount of ‘no’) is called a marginal proportion.

There is probably an association between two variables if there is a clear explanatory/response relationship, that dictates which way we compute the conditional proportions. Conditional proportions are useful in determining if there’s an association. A variable can be independent from another variable.

THE ASSOCIATION BETWEEN TWO QUANTITATIVE VARIABLES
We examine a scatterplot to study association. There is a difference between a positive association and a negative association. If there is a positive association, x goes up as y goes up. If there is a negative association, x goes up as y goes down.

Correlation describes the strength of the linear association. Correlation (r) summarizes th direction of the association between two quantitative variables and the strength of its linear trend. It can take a value between -1 and 1. A positive value for r indicates a positive association and a negative value for r indicates a negative association. The closer r is to 1, the closer the data points fall to a straight line and the stronger the linear association is. The closer r is to 0, the weaker the linear association is.

The properties of the correlation:

  • The correlation always falls between -1 and +1.
  • A positive correlation indicates a positive association and a negative correlation indicates a negative association.
  • The value of the correlation does not depend on the variables’ unit (e.g: euros or dollars)
  • Two variables have the same correlation no matter which is treated as the response variable and which is treated at the explanatory variable.
 

 

The correlation r can be calculated as following:

N is the number of points.  and ȳ are means and

.....read more
Access: 
JoHo members
Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 5 summary

Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 5 summary

Image

HOW PROBABILITY QUANTIFIES RANDOMNESS
Probability is the way we quantify uncertainness. It measures the chances of the possible outcomes for random phenomena. A random phenomenon is an everyday occurrence for which the outcome is uncertain. With random phenomena, the proportion of times that something happens is highly random and variable in the short run, but very predictable in the long run. The law of large numbers states that if the number of trials increases, the proportion of occurrences of any outcome approaches a given number. The probability of a particular outcome is the proportion of times that the outcome would occur in a long run of observations.

Different trials of a random phenomena are independent if the outcome of any one trial is not affected by the outcome of any other trial (e.g: if you have three children who are boys, the chance of the next child being a girl is not higher, but still ½).

In the subjective definition of probability, the probability is not based on objective data, but rather subjective information. The probability of an outcome is defined to be a personal probability. This is called Bayesian statistics.

FINDING PROBABILITIES
The sample space is the set of all possible outcomes (e.g: with being pregnant, the sample space is: {boy, girl}). An event is a subset of the sample space. An event corresponds to a particular outcome or a group of possible outcomes (e.g: a particular outcome or a group of possible outcomes). The probability of an event has the following formula:

For example, if you want to know the probability of the event throwing 6 with a fair dice, you calculate it like this:

  • Number of outcomes in event A: 1 (there is only one possibility to throw 6)
  • Number of outcomes in the sample space: 6 (you can throw between 1 and 6)
  • P(A) = 1/6

The rest of the sample space for event A is called the complement of A. The complement of an event consists of all outcomes in the sample space that are not in the event.

Events that do not share any outcomes in common are disjoint (e.g: two events, A and B, are disjoint if they do not have any common outcomes). The chance that in the case of two events, A and B, both occur is called the intersection. The event that the outcome is A or B is the union of A and B.

 

 

There are three general rules for calculating the probabilities:

  1. Complement rule
  2. Addition rule
    There are two parts of the addition rule. For the union of two events:

    If the events are disjoint:
.....read more
Access: 
Public
Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 6 summary

Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 6 summary

Image

SUMMARIZING POSSIBLE OUTCOMES AND THEIR PROBABILITIES
All possible outcomes and probabilities are summarized in a probability distribution. There is a normal and a binomial distribution. A random variable is a numerical measurement of the outcome of a random phenomenon. The probability distribution of a discrete random variable assigns a probability to each possible value. Numerical summaries of the population are called parameters and a population distribution is a type of probability distribution, one that applies for selecting a subject at random from a population.

The formula for the mean of a probability distribution for a discrete random variable is:

μ= ΣxP(x)

It is also called a weighted average, because some outcomes are likelier to occur than others, so a regular mean would be insufficient here. The mean of a probability distribution of random variable X is also called the expected value of X. The standard deviation of a probability distribution measures the variability from the mean. It describes how far values of the random variable fall, on the average, from the expected value of the distribution. A continuous variable is measured in a discrete manner, because of rounding. A probability distribution for a continuous random variable is used to approximate the probability distribution for the possible rounded values.

PROBABILITIES FOR BELL-SHAPED DISTRIBUTIONS

The z-score for a value x of a random variable is the number of standard deviations that x falls from the mean. It is calculated as:

 

The standard normal distribution is the normal distribution with mean  and standard deviation  . It is the distribution of normal z-scores.

PROBABILITIES WHEN EACH OBSERVATION HAS TWO POSSIBLE OUTCOMES
An observation is binary if it has one of two possible outcomes (e.g: accept or decline, yes or no). A random variable X that counts the number of observations of a particular type has a probability distribution called the binomial distribution. There are a few conditions for a binomial distribution:

  1. Two possible outcomes
    Each trial has two possible outcomes.
  2. Same probability of success
    Each trial has the same probability of success
  3. Trials are independent

The formula for the binomial probabilities for any n is:

The binomial distribution is valid if the sample size is less than 10% of the population. There are a couple of formulas for the binomial distribution:

 and

 

Access: 
Public
Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 7 summary

Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 7 summary

Image

HOW SAMPLE PROPORTIONS VARY AROUND THE POPULATION PROPORTION
The sample distribution of a statistic is the probability distribution that specifies probabilities for the possible values the statistic can take. The population distribution from which you take the sample. Values of its parameters are fixed, but usually unknown. Data distribution is the distribution of the sample data. It is also called sample proportion. Sampling distribution is the distribution of a sample statistic such as a sample proportion. Sampling distributions describe the variability that occurs from sample to sample.

For a random sample size n from a population with proportion p of outcomes in a particular category, the sampling distribution of the sample proportion in that category has:

 and 

For a large sample size n, the binomial distribution has a normal distribution. The central limit theorem states that the sampling distribution of the sample mean often has approximately a normal distribution. This result applies no matter what the shape of the population distribution from which the samples are taken. The standard deviation of the sampling distribution has the following formula:

The larger the sample, the closer the sample mean tends to fall to the population mean.

Access: 
Public
Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 8 summary

Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 8 summary

Image

POINT AND INTERVAL ESTIMATES OF POPULATION PARAMETERS
A point estimate is a single number that is our best guess for the parameter (e.g: 25% of all Dutch people are above 1,80m). An interval estimate is an interval of numbers within which the parameter value is believed to fall (e.g: between 20% and 30% of the Dutch people are above 1,80m). The margin of error gives the lower border and the upper border of the margin.

A good estimator of a parameter has two properties:

  1. Unbiased
    A good estimator has a sampling distribution that is centred at the parameter. A mean from a random sample should fall around the population parameter and this is especially the case with multiple samples and thus a sampling distribution.
  2. Small standard deviation
    A good estimator has a small standard deviation compared to other estimators. The sample mean is preferred over the sample median, even in a normal distribution, because the sample mean has a smaller standard deviation.

An interval estimate is designed to contain the parameter with some chosen probability, such as 0.95. Confidence intervals are interval estimates that contain the parameter with a certain degree of confidence. A confidence interval is an interval containing the most believable values for a parameter. The probability that this method produces an interval that contains the parameter is called the confidence level. A sampling distribution of a sample proportion gives the possible values for the sample proportion and their probabilities and is a normal distribution if np is larger than 15 and n(1-p) is larger than 15. The margin of error measures how accurate the point estimate is likely to be in estimating a parameter.

CONSTRUCTING A CONFIDENCE INTERVAL TO ESTIMATE A POPULATION PROPORTION
The point estimate of the population proportion is the sample proportion. The standard error is the estimated standard deviation of a sampling distribution. The formula for the standard error is:

The greater the confidence level, the greater the interval. The margin of error decreases with bigger samples, because the standard error decreases with bigger samples. The larger the sample, the narrower the interval. If using a 95% confidence interval over time, then 95% of the intervals would give correct results, containing the population proportion.

CONSTRUCTING A CONDIFENCE INTERVAL TO ESTIMATE A POPULATION MEAN
The standard error for the population mean has the following formula:

The t-score is like a z-score, but a bit larger, and comes from a bell-shaped distribution that has slightly thicker tails than a normal distribution. The distribution that uses the t-score and the standard error, rather than the z-score and the standard deviation is called the t-distribution. The standard deviation of the t-distribution is a bit larger than 1, with the precise value depending on what is called the degrees of freedom. The t-score has

.....read more
Access: 
Public
Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 9 summary

Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 9 summary

Image

STEPS FOR PERFORMING A SIGNIFICANCE TEST
A hypothesis is a statement about the population. A significance test is a method for using data to summarize the evidence about a hypothesis. The null hypothesis (H0) is a statement that the parameter takes a particular value (e.g: probability of getting a baby girl: p = 0.482). The alternative hypothesis (Ha) states that the parameter falls in some alternative range of values. A significance test has five steps:

  1. Assumptions
    Each significance test has certain assumptions or has certain condition under which it applies (e.g: an assumption is the assumption that random sampling has been used).
  2. Hypotheses
    Each significance test has two hypotheses about a population parameter. The null hypothesis and the alternative hypothesis.
  3. Test statistic
    The parameter to which the hypotheses refer has a point estimate. A test statistic describes how far that point estimate falls from the parameter value given in the null hypothesis. This is usually measured in number of standard errors between the point estimate and the parameter.
  4. P-value
    A probability summary of the evidence against the null hypothesis is used to interpret a test statistic. The P-value is the probability that the test statistic equals the observed value or a value even more extreme. It is calculated by presuming that the null hypothesis is true.
  5. Conclusion
    The conclusion of the significance test reports the P-value and interprets what is says about the question that motivated the test.

SIGNIFICANCE TESTS ABOUT PROPORTIONS
The steps for the significance test are the same for proportions. The biggest assumption made here is that the sample size is large enough that the sampling distribution is approximately normal. The hypotheses are the following for significance tests about proportions:

 and or

This is called a one-sided alternative hypothesis, because it has values falling only on one side of the null hypothesis value. A two-sided alternative hypothesis has the form of:

The test statistic of a significance test about proportions is:

 or

The P-value of a test statistic of a significance test about proportions is the left- or right-tail probability of a test statistic value even more extreme than observed. Smaller P-values indicate stronger evidence against the null hypothesis, because the data would be more unusual if the null hypothesis were true. In a two-sides test, the P-value is the probability of a single tail doubled. The significance level is a number such that we reject H0 if the P-value is less than or equal to that number. The most common significance level is 0.05. If the data provide evidence to reject H0 and accept Ha, the data is called statistically significant. If Ha is rejected, this does not mean that

.....read more
Access: 
JoHo members
Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 10 summary

Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 10 summary

Image

CATEGORICAL RESPONSE: COMPARING TWO PROPORTIONS:
Bivariate methods is the general category of statistical methods used when we have two variables. The outcome variable on which comparisons are made is called the response variable. The binary variable that specifies the groups is the explanatory variable. In an independent sample, observations in one sample are independent from observations in another sample. If two samples have the same subjects, they are dependent. If each subject in one sample is matched with a subject in another sample there are matched pairs and the data is dependent as well.

The formula for the standard error for comparing two proportions is:

A 95% confidence interval for the difference between two population proportions has the following formula:

The proportion (p̂) is called a pooled estimate, since it pools the total number of successes and total number of observations from two samples. This uses the presumption p1=p2. The test statistic uses the following formula:

The standard error for the test statistic uses the following formula:

QUANTITATIVE RESPONSE: COMPARING TWO MEANS:
The standard error for comparing two means has the following formula:

A 95% confidence interval for the difference between two population means has the following formula:

The confidence interval for the difference between two population means uses the t-distribution and not the z-distribution. Interpreting a confidence interval for the difference of means uses the following criteria:

  1. Check whether or not 0 falls in the interval
    If it does, it could be that mean 1 is mean 2.
  2. Positive confidence interval suggests that mean 1 – mean 2 is positive
    If the confidence interval only contains positive numbers, this suggests that mean 1 – mean 2 is positive. This suggests that mean 1 is larger than mean 2.
  3. Negative confidence interval suggests that mean 1 – mean 2 is negative
    If the confidence interval only contains negative numbers, this suggests that mean 1 -  mean 2 is negative. This suggests that mean 1 is smaller than mean 2.
  4. Group order is arbitrary
    It is arbitrary whether one group is group one or the other.

The test statistic of a significance test comparing two population means uses the following formula:

It uses minus zero because the null hypothesis is that there is no difference between the groups and is thus zero.

OTHER WAYS OF COMPARING MEANS AND COMPARING PROPORTIONS
If it is reasonable to expect that the variability as

.....read more
Access: 
JoHo members
Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 11 summary

Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 11 summary

Image

INDEPENDENCE AND DEPENDENCE (ASSOCIATION)
Conditional percentages refer to a sample data distribution, conditional on a category. They form the conditional distribution. If the probabilities for two different categorical variables are the same in the same category, then these variables are independent. If the probabilities for two different categorical variables differ, then these variables are dependent. Dependence refers to the population, so if there is barely any difference between two categorical variables in a sample, it could be independent, even though they differ.

TESTING CATEGORICAL VARIABLES FOR INDEPENDENCE
The expected cell count is the mean of the distribution for the count in any particular cell. The formula for the expected cell count is the following:

The chi-squared statistic summarizes how far the observed cell counts in a contingency table fall from the expected cell counts for a null hypothesis. It is the test statistic for the test of independence. The formula for the chi-squared statistic is:

The sampling distribution using the chi-squared statistic is called the chi-squared probability distribution. The chi-squared probability distribution has several properties:

  1. Always positive
  2. Shape depends on degrees of freedom
  3. Mean equals degrees of freedom
  4. As degrees of freedom increases the distribution becomes more bell shaped
  5. Large chi-square is evidence against independence

The degrees of freedom in a table with r rows and c columns can be calculated as following:

If a response variable is identified and the population conditional distributions are identical, they are said to be homogeneous. The chi-squared test is then referred to as a test of homogeneity. The degrees of freedom value in a chi-squared test indicates how many parameters are needed to determine all the comparisons for describing the contingency table. The chi-squared test can test for independence, but it cannot provide information about the strength and the direction of the associations and provide information about the practical significance, only about the statistical significance. When testing particular proportion values for a categorical variable, the chi-squared statistic is referred to as a goodness-of-fit statistic. The statistic summarizes how well the hypothesized values predict what happens with the observed data.

DETERMINING THE STRENGTH OF THE ASSOCIATION
A measure of association is a statistic or a parameter that summarizes the strength of the dependence between two variables. The association can be measured by looking at the difference of two associations. The formula for the difference of the two proportions is the following:

The ratio of two proportions is also a measure of association. This is also called the relative risk. The relative risk uses the following formula:

The relative risk has several properties:

  1. The relative risk can equal any non-negative number
  2. When p1=p2, the variables
.....read more
Access: 
JoHo members
Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 12 summary

Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 12 summary

Image

MODEL HOW TWO VARIABLES ARE RELATED
A regression line is a straight line that predicts the value of a response variable ‘y’ from the value of an explanatory variable ‘x’. The correlation is a summary measure of association. The regression line uses the following formula:

The data is plotted before a regression line is made, because it can be strongly influenced by outliers. The regression equation is often called a prediction equation. The difference between y - ŷ, between an observed outcome y and its predicted value ŷ is the prediction error, called the residual. The average of the residuals is zero. The regression line has a smaller sum of squared residuals than any other line. It is called the least squares line. The population regression equation has the following formula:

This formula is a model. A model is a simple approximation for how variables relate in a population. The probability distributions of y values at a fixed value of x is a conditional distribution (e.g: the means of annual income for people with 12 years of education).

DESCRIBE STRENGTH OF ASSOCIATION
Correlation does not differentiate between response and explanatory variables. The formula for the slope uses the correlation and can be calculated as following:

Using this formula, the y-intercept can be calculated:

The slope can’t be used to determine the strength of the association, because it determines on the units of measurement. The correlation is the standardized version of the slope. The formula for the correlation is the following:

A property of the correlation is that at any particular x value, the predicated value of y is relatively closer to its mean than x is to its mean. If a particular ‘x’ value falls 2.0 standard deviations from the mean with a correlation of 0.80, then the predicted ‘y’ is ‘r’ times that many standard deviations from its mean, so the predicted ‘y’ would be 0.80 times 2.0 standard deviations from the mean. The predicted ‘y’ is relatively closer to its mean than ‘x’ is to its mean. This is regression toward the mean. If the first observation is extreme, the second observation will be more toward the mean and will be less extreme.

Predicting ‘y’ using ‘x’ with the regression equation is called the residual sum of squares and this uses the following formula:

The measure r squared is interpreted as proportional reduction in error (e.g: if r squared = 0.40, the error using y-hat to predict y is 40% smaller than the error using y-bar to predict y). The formula for r squared

.....read more
Access: 
JoHo members
Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 14 summary

Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 14 summary

Image

ONE-WAY ANOVE: COMPARING SEVERAL MEANS
The inferential method for comparing means of several groups is called analysis of variance, also called ANOVA. Categorical explanatory variables in multiple regression and in ANOVA are referred to as factors, also known as independent variables. An ANOVA with only one independent variable is called a one-way ANOVA.

Evidence against the null hypothesis in an ANOVA test is stronger when the variability within each sample is smaller or when the variability between groups is larger. The formula for the F (ANOVA) test is:

When the null hypothesis is true, the mean of the F-distribution is approximately 1. If the null hypothesis is wrong, then F>1. This also increases if the sample size increases. The larger the F-statistic, the smaller the P-value. The F-distribution has two degrees of freedom values:

 and

The ANOVA test has five steps:

  1. Assumptions
    A quantitative response variable for more than two groups. Independent random samples. Normal population distribution with equal standard deviation.
  2. Hypotheses

  3. Test statistic
    y
  4. P-value
    This is the right-tail probability of the observed F-value.
  5. Conclusion
    The null hypothesis is normally rejected if the P-value is smaller than 0.05.

If the sample sizes are equal, the within-groups estimate of the variance is the mean of the g sample variances for the g groups. It uses the following formula:

 

If the sample sizes are equal, the between-groups estimate of the variance uses the following formula:

The ANOVA F-test is robust to violations if the sample size is large enough. If the population sample sizes are not equal, the F test works quite well as long as the largest group standard deviation is no more than about twice the smallest group standard deviation. Disadvantages of the F-test are that it tells us whether groups are different, but it does not tell us which groups are different.

ESTIMATING DIFFERENCES IN GROUPS FOR A SINGLE FACTOR
The F-test only tells us if groups are different, not how different and which groups are different. Confident intervals can. A confidence interval for comparing means uses the following formula:

The degrees of freedom for the confidence interval is:

If the confidence interval does not contain 0, we can infer that the population means are different. Methods that control the probability that all confidence intervals will contain the true differences in means are called multiple comparison methods. Multiple comparison methods compare pairs of means with a confidence level that applies simultaneously to the

.....read more
Access: 
JoHo members
Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 15 summary

Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 15 summary

Image

COMPARE TWO GROUPS BY RANKING
Nonparametric statistical methods are inferential methods that do not assume a particular form of distribution (e.g: the assumption of a normal distribution) for the population distribution. The Wilcoxon test is the best known nonparametric method. Nonparametric methods are useful when the data are ranked and when the assumption of normality is inappropriate.

The Wilcoxon test sets up a distribution using the probability of each difference of the mean rank. This test has five steps:

  1. Assumptions
    Independent random samples from two groups.
  2. Hypotheses


  3. Test statistic
    This is the difference between the sample mean ranks for the two groups.
  4. P-value
    This is a one-tail or two-tail probability, depending on the alternative hypothesis.
  5. Conclusion
    The null hypothesis is either rejected in favour of the alternative hypothesis or not.

The sum of the ranks can also be used, instead of the mean of the ranks. When conducting the Wilcoxon test, a z-test can also be conducted if the sample is large enough. This z-test has the following formula:

A Wilcoxon test can also be conducted by converting quantitative observations to ranks. The Wilcoxon test is not affected by outliers (e.g: an extreme outlier gets the lowest/highest rank, no matter if it’s a bit higher or lower than the number before that). The difference between the population medians can also be used if the distribution is highly skewed, but this requires the extra assumption that the population distribution of the two groups have the same shape. The point estimate of the difference between two medians equals the median of the differences between the two groups. A sample proportion can also be used, by checking what the proportion is of observations in group one that’s better than group two. If there is a proportion of 0.50, then there is no effect. The closer the proportion gets to 0 or 1, the greater the difference between the two groups.

NONPARAMETRIC METHODS FOR SEVERAL GROUPS AND FOR MATCHED PAIRS
The test for comparing mean ranks of more than two groups is called the Kruskal-Wallis test. This test has five steps:

  1. Assumptions
    Independent random samples.
  2. Hypotheses

  3. Test statistic
    The test statistic is based on the between-groups variability in the sample mean ranks. The test statistic uses the following formula:
    The test statistic has an approximate chi-squared distribution with g-1 degrees of freedom.
  4. P-value
    The right-tail probability above observed test statistic value from chi-squared distribution.
  5. Conclusion
    The null hypothesis is either rejected in favour of the alternative hypothesis or not.

It is

.....read more
Access: 
JoHo members
Work for WorldSupporter

Image

JoHo can really use your help!  Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world

Working for JoHo as a student in Leyden

Parttime werken voor JoHo

Check more of this topic?
How to use more summaries?


Online access to all summaries, study notes en practice exams

Using and finding summaries, study notes en practice exams on JoHo WorldSupporter

There are several ways to navigate the large amount of summaries, study notes en practice exams on JoHo WorldSupporter.

  1. Starting Pages: for some fields of study and some university curricula editors have created (start) magazines where customised selections of summaries are put together to smoothen navigation. When you have found a magazine of your likings, add that page to your favorites so you can easily go to that starting point directly from your profile during future visits. Below you will find some start magazines per field of study
  2. Use the menu above every page to go to one of the main starting pages
  3. Tags & Taxonomy: gives you insight in the amount of summaries that are tagged by authors on specific subjects. This type of navigation can help find summaries that you could have missed when just using the search tools. Tags are organised per field of study and per study institution. Note: not all content is tagged thoroughly, so when this approach doesn't give the results you were looking for, please check the search tool as back up
  4. Follow authors or (study) organizations: by following individual users, authors and your study organizations you are likely to discover more relevant study materials.
  5. Search tool : 'quick & dirty'- not very elegant but the fastest way to find a specific summary of a book or study assistance with a specific course or subject. The search tool is also available at the bottom of most pages

Do you want to share your summaries with JoHo WorldSupporter and its visitors?

Quicklinks to fields of study (main tags and taxonomy terms)

Field of study

Access level of this page
  • Public
  • WorldSupporters only
  • JoHo members
  • Private
Statistics
1533
Comments, Compliments & Kudos:

Add new contribution

CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image.
Promotions
Image
The JoHo Insurances Foundation is specialized in insurances for travel, work, study, volunteer, internships an long stay abroad
Check the options on joho.org (international insurances) or go direct to JoHo's https://www.expatinsurances.org

 

Follow the author: JesperN