BulletPointsummary of Statistics: The art and science of learning from data by Agresti and Franklin - 4th edition

What is statistics? - BulletPoints 1
How to explore data with graphs and nummerical summaries? - BulletPoints 2
What roles do contingency, correlation and regression play in association testing? - BulletPoints 3
How do you gather data? - BulletPoints 4
What role does probability have in our daily lives? - BulletPoints 5
What are probability distributions? - BulletPoints 6
What are sampling distributions? - BulletPoints 7
Statistical inference: What are confidence intervals? - BulletPoints 8
Statistical inference: What do significance tests say about hypotheses? - BulletPoints 9
How do you compare two groups? - BulletPoints 10
How do you analyze the association between categorical variables? - BulletPoints 11
How do you analyze the association between quantitative variables: regression analysis? - BulletPoints 12
What is multiple regression? - BulletPoints 13
How do you compare groups: analysis of variance methods? - BulletPoints 14
What does nonparametric statistics mean? - BulletPoints 15

What is statistics? - BulletPoints 1

Statistics is the art and science of designing studies and analyzing the data that those studies produce. Its ultimate goal is translating data into knowledge and understanding of the world around us. In short, statistics is the art and science of learning from data.
Statistics has three main components for answering a statistical question:
- Design: thinking of how to get the data necessary to answer the question.
- Description: the obtained data needs to be summarized and analyzed.
- Inference: making decisions and predictions based on the obtained data for answering the question. (Infer means to arrive at a decision or prediction by reasoning from known evidence).
Descriptive statistics refers to methods for summarizing the collected data (where the data constitutes either a sample or a population). The summaries usually consist of graphs and numbers such as averages. The main purpose of descriptive statistics is to reduce the data to simple summaries without distorting or losing much information.

How to explore data with graphs and nummerical summaries? - BulletPoints 2

Any characteristic observed in a study is referred to as variable. The values of data vary. In a data set, these variables are usually listed in the columns. The rows of the same data set refer to different observations on a variable. Observations refers to the data values that are observed. The observations can be a number or a category. Numerical values that represent different magnitudes of the variable are called quantitative. If a variable belongs to one of a set of distinct categories, the variable is called categorical. Sometimes numbers are used to represent categorical variables. These remain categorical variables and thus are not quantitative. It is because the numbers do not represent different magnitudes of the variable.
For summarizing a categorical variable, the two main graphical displays are:
- The pie chart: is a circle divided in slices of pie for each category. The size of a slice corresponds to the percentage of observations in the category.
- The bar graph: shows a vertical bar for each category. The percentage of observations in the category corresponds to the height of the bar. The vertical bars for each category are typically apart, not side by side.
The mean is the best-known and most frequent used measure of the center of a distribution of a quantitative variable. A mean can be found by averaging the observations. It is interpreted as the balance point of the distribution.
It is much better to use a numerical summary of variability that uses all the data and it describes a typical distance of how far the data falls from the mean. It does this by summarizing deviations from the mean. The deviation is the difference between the observation and the sample mean (x – x̄ ). Each observation has a deviation from the mean and it can be both positive (when the observation falls above the mean) and negative (when the observation falls below the mean).
The pth percentile is a value such that p percent of the observations fall below or at that value. The 50th percentile is typically referred to as the median. Three useful percentiles are the quartiles:
- First quartile (Q1) has p = 25.
- Second quartile (Q2) has p = 50, thus is the median.
- Third quartile (Q3) has p = 75.

What roles do contingency, correlation and regression play in association testing? - BulletPoints 3

The main reason to do a data analysis is to find if there is some association between the variables. This means that one variable is more likely to occur when the other variable is present. In this chapter you get presented methods for studying whether certain associations exist and how strong they are.
If you investigate an association between two variables you have three possible combinations of the two variables in the association;
- Both variables are categorical,
- One variable is quantitative and one variable is quantitative,
- Both variables are quantitative.
One of these pitfalls is extrapolation. This refers to using a regression line to predict y values for x values who fall outside the observed range of data. An example of this extrapolation are forecasts. Predictions about future weather by using time series data. When you want to make an assumption about the future years you must also make the assumption that the past trend will remain the same in the future, that is very risky because everything changes over time. So the prediction that you make will never be a hundred percent reliable.

How do you gather data? - BulletPoints 4

Statistics are used to learn about a certain population. A population consists of all the subjects of interest. But sometimes it is too difficult and time-consuming to seek out all the people who fall in this population (for example, all the people in the Netherlands who studied psychology). As a solution researchers use a subset of the population to collect the data, we call that the sample. Those studies also have two primary variables of interest, namely the response variable (outcome variable) and the explanatory variable (independent variable).
Lurking variables in a observational studies can affect the results. Those are variables that are not observed in the study, but influences the association between the response and explanatory variable due to the association with both variables. By contrast, an experiment reduces the potential for a lurking variable effect. This is because in an experiment researchers use a type of 'random' selection. The groups have similar distribution on different lurking variables due to the random selection. So, when the groups are balanced on a lurking variable, there won't be an association between the lurking variable and the explanatory variable.
The first thing you need to do when you want to take a sample survey is to define the population you want to target. The second step is to compile a list of subjects so you can make a sample of the population, this is called the sampling frame. Ideally, the sampling frame lists the entire population of interest. When you have a sampling frame, you need a certain method for selecting the sample. This is called the sampling design.
There is bias if the way in which the study was designed or the data was gathered made certain outcomes occur more or less often in the sample than they do in the whole population.
- Sampling bias: Bias may result from the sampling method, this occurs when the sample is not random. Undercoverage can also give problems; having a sampling frame that lacks representation from part of the population. Responses by those who are in the sampling frame can be quite different from those who are in the frame.
- Nonresponse bias: This occurs when some sampled subjects refuse to participate or can't be reached. Then the sample would not be random anymore. Even those who do participate may not respons to some questions, resulting in nonresponse bias due to missing data.
- Response bias: An interviewer might ask questions that may lead to subjects to respond in a certain way. You always need to avoid questions that are confusing, long or leading. Also this bias occurs when subjects give an incorrect response/lie.
In a experiment the subjects are often referred to as experimental units. Also when you have a treatment group you need to have a comparison group to compare the result of the treatment group with no treatment. The second group is called the control group. Often the control group gets a placebo pill, this is partly so that the treatments appear identical to the subjects, but also because people who take a placebo tend to respond better than those who receive nothing. This is called the placebo effect.
To use random sampling, you also need a sampling frame of nearly all the subjects in the population. When this is not available, it is easier to identify clusters of subjects, for example city blocks of the population of New York. This is called cluster random sample. But the disadvantages are that you usually require a larger sample size because you need to make clusters of the subjects, and selecting a small number of clusters might result in a more homogeneous sample than the population is in real life.
An experiment can investigate cause and effect better than observational study. But it is possible to design an observational study that controls for identified lurking variables. Some studies are backward looking (retrospective) or forward looking (prospective). A type of retrospective study is called a case-control study. It is a study in which subjects who have a respons outcome of interest and subjects who have the other response outcome are compared on the explanatory variable.

What role does probability have in our daily lives? - BulletPoints 5

Researchers rely on randomness to make sure that there will be no bias in the data. Randomness also applies to the outcomes of a response variable. It helps to make games fair, everyone will have the same chances for possible outcomes.
Here will follow some basic rules that will help you find probabilities in certain situations:
- The first step is to list all the possible outcomes of the random phenomena. The set of possible outcomes for a random phenomenon is called a sample space. If you roll a die once, is has six possible outcomes, those are {1,2,3,4,5,6}. But if you flip a coin twice you will have four different outcomes, those are {HH,HT,TH,TT}, H meaning head and T meaning tail.
- If you want to visualize a small number of outcomes, the best thing you can do is to make a tree diagram.
- You sometimes need to specify a certain group of outcomes in a sample space. A subset of a sample space is called an event. They are usually denoted by letters from beginning of the alphabet of by a string of letters that describes the event.
- Each outcome in the sample space has a probability, to find such probabilities, you list the sample space and specify plausible assumptions about the outcomes. The probability of the sample space lies between 0 and 1. The total of all individual probabilities equals 1.
Probability is a big part of your own life. When you have to make decisions or plan important things, you take the probability of a certain situation or outcome into account. Events that may seem coincidental, such as seeing a acquaintance on a summer holiday, are often not so unusual when you put it in context of all possible random occurrences in life. When you have a large population of people, times and things, that may seem surprising things are actually quite sure to happen. Once you have data, you can find patterns quickly. This is because our minds are programmed to look for patterns in data. It’s not surprising that one might find some pattern that first seems quite unusual. Sometimes a cluster of occurrences of a disease in a neighborhood might cause worry by the residents. But sometimes it is ‘normal’ that certain diseases cluster in a nation just by chance.

What are probability distributions? - BulletPoints 6

In statistics, possible outcomes and their probabilities are summarized in a probability distribution. There are two sorts of probability distributions someone can use, namely a normal and a binomial distribution. The normal distribution is known for its bell-shaped form, and plays a key role in statistical inference.
When a random variable has separate possible values, such as numbers of tail in three flips of a coin {0,1,2,3}, the variable is called discrete. The probability distribution assigns a probability to each of the possible value. So each probability of seeing the tail instead of head falls between 0 and 1, the sum of all the probabilities equals 1. P(x) stands for the possible probability of x, where P(2) means that the probability of random variable x takes the value of 2. Random variables can also be continuous, this means that the possible values the random variable can have, are at interval rate rather than having seperate numbers.
The mean of a probabilitiy distribution is the value that you would get, in the long run, when you repeatedly roll a die, and get the average of those values. The mean μ = ΣxP(x) is called the weighted average. Some values of x can get a greater weight of P(x), it doesn't make sense in that way to take the sum of all the probabilities and then take the average, because some outcomes are more likely than others and need to receive more weight. The mean μ of the probability distribution of random variable X is also called the expected value of X, this reflects what someone will expect to see for the average when you do a long run of observations of the random variable X.
When you have a probability distribution, it is necessary to summarize the center and the variability of the probability distribution. The standard deviation σ measures the variability from the mean. It's also that larger values correspond to a greater variability of the variable. Roughly, standard deviation σ describes how far the values of the random variable fall, of the average (mean), from the expected value of the distribution.
The z-score for an observation is the number of standard deviations that it falls from the mean, it can be used for any distribution for a quantitative variable, both normal and nonnormal distributions. The normal distribution is the most important distribution in statistics, mainly because many variables have approximately normal distributions, but also it approximates many discrete distributions when there are large numbers of possible outcomes. But the main reason for using the normal distribution is that many statistical methods use it even when the data are not bell-shaped.
The empirical rule states that for an approximately bell-shaped distribution about 68% of the observations made fall within 1 standard deviation of the mean, 95% fall between 2 standard deviations, and nearly all within 3 standard deviations of the mean.
In many cases, each observation is binary: this means that it has one or two possible outcomes. With a sample, you summarize these variables by counting the number of the proportion of cases with an outcome of interest. If you have n = 5, the possible values for X are {0,1,2,3,4,5}. Under certain conditions, a random variable X that counts the number of observations has a probability distribution called the binominal distribution.

What are sampling distributions? - BulletPoints 7

For example, you have X = vote outcome, with x = 1 people who voted for Trump, and x = 0 people who voted for all other responses. The outcome is binary and we are only interested in whether someone voted for Trump. The possible values of random variable X (0 and 1) and the proportion of times these values occured give the data distribution for this one sample, these form the population distribution.
Population distribution: This is the distribution from which we take the sample. The parameters, such as the population proportion p for a categorical variable, are fixed but usually unknown. By taking a sample of the population distribution, you can learn and make predictions about the unknown population parameter(s).
Data distribution: This is the distribution of the data obtained from the sample and is the one we actually see in practice. It is described by statistics such as the sample proportion and the sample mean. Looking at random sampling, how larger the sample size n is, the more closely the data distribution resembles the population distribution.
Sampling distribution: This is the distribution of a statistic such as a sample proportion or a sample mean. This can vary from sample to sample, so you can get an entire distribution of possible values for it. With using random sampling, the sampling distribution provides probabilities for all possible values of the statistic.
You can describe the sampling distribution of a sample proportion by focusing on the key features of shape, center and variability. In a sampling distribution of a sample proportion, the mean and standard deviation depend on the sample size n and the population proportion p. Mean = p and standard deviation = square root ( ( p (1 - p) ) / n )).
But is it also typical for sampling distributions of a sample mean to have bell shapes even if the population distribution is not bell-shaped? Even if a population distribution is not bell-shaped but rather skewed, the sampling distribution of the sample mean can have a bell shape. Also, the sample mean appears to be the same as the population mean and the standard deviations are also the same.
The bell shape is a consequence of the central limit theorem (CLT). This states that the sampling distribution of the sample mean often has approximately a normal distribution, this applies no matter what the shape of the population distribution is, from which the samples are taken.

Statistical inference: What are confidence intervals? - BulletPoints 8

Statistical inference methods helps us to predict how close a certain sample statistic falls to the population parameter. You then can make decisions and predictions about populations even if we have data for relatively few subjects from that population. There are a few relevant concepts in statistical inference, such as the role of randomnization, concepts of probability, the normal distribution and the use of the sampling distribution. There are two types of statistical inference, namely estimation and testing hypotheses. This chapter discusses the estimation in statistical inference. The most informative estimation method is about an interval of numbers, mainly known as the confidence interval.
A point estimate is a single numner that is our best guess for the parameter. An interval estimate is an interval of numbers that is believed to contain the actual value of the parameter. The adjective 'point' in point estimate refers to using a single number, instead of a range, or point as parameter estimate. An interval estimate is more useful because it tells us something about how close the estimate is likely to be to the parameter, taking the margin of error into account.
For an estimate to be useful you have to know how close the estimate is to the actual parameter value. Inference about a parameter should provide not only a point estimate but should also indicate its likely precision. An interval estimate does this by giving an interval of numbers around the point estimate. The interval is then made up of numbers that are the most believable values for the unknown parameter, based on the observed data. So an interval estimate is designed to contain the parameter with some chosen probability, such as 0.95. An confidence interval is an interval containing the most believable values for a parameter. It is formed by a method that combines a point estimate with a margin of error. The probability that this method produces an interval that contains the parameter is called the confidence level. This is a number chosen to be close to 1, most commonly 0.95.
The margin of error measures how accurate the point estimate is likely to be estimating a parameter. It is a multiple of the standard deviation of the sampling distribution of the point estimate, such as 1.96 x (standard devation) when the sampling distribution is a normal distribution.
What does it mean when you say you have a 95% confidence interval? The meaning refers to a long-run interpretation, how the method performs when used over a long period of many random samples. If you would use a 95% confidence interval than in the long run about 95% of those intervals would give correct results, containing the population proportion. But 5% of the time the sample proportion p̂ would not fall within the 1.96(se) of p.
The sample mean x̅ is the point estimate of the population mean μ. Like the standard devation of the sample proportion (σ / square root (n)), the standard deviation of the sample mean depends on a parameter whose value is unknown, in this case σ. You need to estimate the σ by the sample standard deviation s to be able to compute a margin of error and a confidence interval. se = s / square root (n).
A basic assumption of the confidence interval using the t distribution is that the population distribution is normal. How problematic is it if we use the t confidence interval even if the population distribution is not normal? For large samples is this not problematic because of the central limit theorem -> the sampling distribution is bell shaped even when the population distribution is not, but what about for small n. Fortunately, the confidence interval using the t distribution is a robust method in terms of normality assumption. This means that, with respect to a particular assumption, it will performs adequately even when that assumption is modestly violated.
How large the n must be, you first must decide on the desired margin of error or how close the sample proportion should be to the population proportion. Also you must choose what confidence interval you want to use for achieving a certain margin of error.
The bootstrap is a simulation method that resamples from the observed data. It treats the data distribution as if it were the population distribution. You resample, with replacement, n observations from the data distributions. Each original n data points has probability 1/n for being selected as one of the 'new' observations in the bootstrap sample. For this new of size n, you construct a point estimate of the parameter. You then resample another set of n observations from the original data distribution and construct another value of the point estimate. The variability of the resampled point estimates provides information about the accuracy of the original point estimate.

Statistical inference: What do significance tests say about hypotheses? - BulletPoints 9

Significance testing is the second major method for conducting statistical inference about a population, next to the confidence interval (estimating a parameter), the significance test uses probability to quantify how plausible a parameter value is while controlling for chance of an incorrect inference.
A significance test is a method for using data to summarize the evidence about a hypothesis. Before conducting a significance test, you must identify the variable measured and the population parameter of interest. For a categorical variable the parameter is typically a proportion, for a quantitative variable the parameter is typically a mean.
A significance test has five steps
- Step 1: Assumptions, Each significance test makes certain assumptions or conditions under which the test applies. A test assumes that the data production is random. Other assumptions can be about sample size, shape of the population distribution etc.
- Step 2: Hypotheses, Each significance test has two hypotheses about a population parameter, the null hypothesis and the alternative hypothesis. The null hypothesis is a statement that the parameter takes a particular value. The alternative hypothesis states that the parameter falls in some alternative range of values. The value of the null hypothesis usually represents no effect, the alternative hypothesis represents an effect of some type. But it is not completely clear what the effect is. The symbol H₀denotes for the null hypothesis, the symbol H_a denotes for the alternative hypothesis.
- Step 3: Test Statistic, The parameter to which the hypotheses refer has a point estimate. A test statistic describes how far that point estimate falls from the parameter value given in the null hypothesis.
- Step 4: P-value, To interpret a test statistic we use a probability summary of the evidence against the null hypothesis. You presume that H₀ is true, then we consider values we'd expect to get for the test statistic, presuming H₀ is true. If the test statistic falls well out in the tail of the sampling distribution, it is far from what H₀ predicts. Any single possible value may be unlikely, so you summarize how far out in the tail the test statistic falls by the tail probability. This probability is called a p-value, the smaller the p-value, the stronger the evidence againts H₀.
- Step 5: Conclusion, The conclusion of significance testing is reporting the p-value and interprets what is says about the question/motivation of the test. Sometimes you can even say something about the validity of the null hypothesis. For instance, based on the p-value, can you reject H₀ and give more prove to H_a? We can reject H₀ in favor of H_a only when the p-value is very small, such as 0.05 or less.
A significance test analyzes the strength of the evidence against the null hypothesis. You presume that H₀ is true, putting the burden of proof on H_a. But to show yourself that the alternative hypothesis is true, one must show data to contradict the null hypothesis. If the p-value you find is very small, then you can contradict the null hypothesis and find support for the alternative hypothesis.
A small p-value means that the sample data would not be unusual if the null hypothesis were true. When a p-value is not small, the conclusion is reported as do not reject the null hypothesis because the data do not contradict the null hypothesis. But 'do not reject H₀, is not the same as saying "accept H₀"'. An analogy here is again that of a courtroom trial. The null hypothesis is that the defendant is innocent, the alternative hypothesis is that the defendant is guilty, if the jury acquits the defendant it does not mean that it accepts the defendant's claim of innocence. It merely means that innocence is plausible because guilt has not been established beyond reasonable doubt.
A significance test about means has the same five steps as mentioned above, but then about a proportion: assumptions, hypotheses, test statistic, p-value and conclusion.
Significance testing always have some uncertainty due to the sampling variability. The two potential errors a test can have are called the type I and Type II errors. When the null hypothesis is true, a type I error occurs when the null hypothesis is wrongly rejected. When the null hypothesis is false, a type II error occurs when the null hypothesis is wrongly not rejected.
Significance testing can be very useful, but researchers think that this method has been overemphasized in research. A significance test merely indicates whether the particular parameter value in the null hypothesis is plausible. When the p-value is small, the significance test indicates that the hypothesized value is not plausible, but it tells us also very little about which potential parameters are plausible. A confidence interval is therefor more informative because it displays an entire set of plausible values.

How do you compare two groups? - BulletPoints 10

Most comparisons of groups use independent samples from the groups, this means that the observations in one sample are independent of those in the other sample. Independent samples are made, for instance when you seperate certain people based on gender or smoking status. When two samples involve the same subjects, then they are dependent of each other. For instance a diet study, where people are measured at the start and at the end to see how much weight they lost. You can also keep in mind, that when you measure one person, two times, it always is a dependent sample. Dependent samples also result when the data are matched pairs. Each subject in one sample is matched with a subject in the other sample, such as married couples.
The confidence interval tells us how close an estimate is likely to be to the population difference (p₁ - p₂). To obtain the confidence interval we take the estimated difference and add and substract a margin of error based on the standard error. The confidence interval has the form (p̂₁ - p̂₂) ± z(se).
You can also compare p₁ and p₂ with each other by significance testing. H₀: p₁= p₂, the population proportion taking the same value for each group. Under the presumption for H₀ that p₁ = p₂ we estimate the common value of p₁ and p₂ by the proportion of the total sample in the category of interest. The proportion p̂ is called the pooled estimate. This is because it pools the total number of successes and total number of observations from the two samples. The test statistic measures the number of standard errors that the sample estimate falls from the null hypothesis value of 0: z = estimate - null hypothesis value / standard error = (p̂₁ - p̂₂) - 0 / se₀.
You can also compare two means using a significance test of the null hypothesis, H₀: μ₁ = μ₂. The assumptions are the same as with the confidence interval, and the random samples are approximately normal distributed, when looking at the central limit theorem (n > 30).
There is also a test to compare the population standard deviations to see whether it is reasonable to assume that they are equal, with null hypothesis saying H₀: σ₁= σ₂. The test statistic for this method is denoted with a F, and the test is called a F-test. The F-test for comparing standard deviations of two populations performs very poorly when the populations are not close to normal. That is why statisticans do not recommend it for general use. When the data shows that the standard deviations may have potentially large differences it is better to use the two-sample-t-inference methods.
With dependent samples, each observation in one sample has a matched observation in the other sample. These observations are called matched pairs. You use dependent samples because a source of potential bias are controlled so we can make more accurate comparisons.

How do you analyze the association between categorical variables? - BulletPoints 11

When you want to investigate an association, first it is very important to identify the response and the explanatory variable. It is, for instance, more natural to study the influence of income (high/low) on happiness instead of the other way around. So, income is the explanatory variable and happiness the response variable. You can put this data in a contingency table. The percentages in a row are called the conditional percentages. They refer to the distribution of happiness. The distribution is called the conditional distribution. You also have proportions that are called conditional probabilities of happiness, in this case.
Independence and dependence association; Two categorical variables are independent if the population conditional distributions for one of them are identical at each category of the other. The variables are dependent (or associated) if the conditional distributions are not identical. There is also a difference between the independence of events to the independence of variables. Two events A and B are independent if P(A) is the same conditional probability of A, given B, denoted by P(A|B). When two variables are independent, any event about one variable is independent of any event about the other variable.
When you want to find the p-value for the X² test statistic, you use the sampling distribution of the X² statistic. For the large sample sizes (>30) the sampling distribution is approximated by the chi-squared probability distritbution.
When you presume that the null hypothesis; independence is true, than the sampling distribution of the X² test statistic gets closer to the chi-squared distribution as the sample size increases. What you have to keep in mind is that, as with any hypothesis test, failing to reject the null hypothesis does not mean that the variables are definitely independent. All that can be said is that independence is still plausible and can't be ruled out.
A measure of association is a statistic or a parameter that summarizes the strength of the dependence between two variables. It can take on a range of values from one extreme to another as data range from the weakest to the strongest association.
A third measure for the association is called the odds. In the contingency table, the odds are given in a row and are defined as the ratio of the two conditional proportions in that row. This equals p₁( 1 - p₁) in the first row and p₂( 1 - p₂). The odds ratio is then the ratio of these two odds: p₁( 1 - p₁) / p₂( 1 - p₂).
The chi-squared test and the measures of association such as the difference and ratio of proportions and the odds ratio are fundamental methods for analyzing contingency tables. The p-value for the X² says a lot about the strength of evidence against the null hypothesis; independence. When the p-value is small you can conclude that the population cell proportions differ somewhere in the contingency table.

How do you analyze the association between quantitative variables: regression analysis? - BulletPoints 12

A regression line is a straight line that predicts the value of a response variable y from the value of an explanatory variable x. The correlation, denoted by the letter r, is a summary measure of the association that falls between -1 and +1. You'll learn how to make inferences about the regression line for a population and how the variability of data points around the regression line helps us predict how far from the line a value of y is likely to fall.
The first step of a regression analysis is to identify the response and explanatory variable. The first step in answering the question whether these two variables are associated with each other, is to look at what the scatterplot shows. For each observation, a point is representing its value on the two variables. The points are shown relative to the x (horizontal) and the y (vertical) axes. The scatterplot can show you whether there is a straight line trend, or if there is some sort of relationship between the x and the y which people may call linear.
When the scatterplot shows you a linear trend between the x and the y, you can use the notation ŷ= ax + b. This is called the regression line. The ŷ represents the predicted value of the response variable y. In the formula is a the y-intercept and is b the slope. The regression line enables you to further describe the nature of the linear relationship that goes beyond just stating the strength of the association between the two variables with only using the correlation r.
The regression equation is often called a prediction equation because substituting a particular value of x into the equation provides a prediction for y at that value of x. The difference between the outcome y and the predicted value of y is the prediction error, or the residual.
At a given value of x the equation will predict a single value of the ŷ of the response variable. But we should not expect all subjects at that value of x to have the same value of y. The regression line connects the estimated means of y at the various x values. A similar equation describes the relationship between x and the means of y in the population. This is called the population regression equation and the formula for this is: μ_y = α + βx.
The alpha and bèta are parameters, the alpha is the population y-intercept and the bèta is the slope. But because these are parameters the value is unknown. In practice, you will use a to estimate alpha and b to estimate bèta and a + bx to estimate the μ_y. A straight line is the best and simplest way to descibe a relationship between two quantitative variables, but in practice relationships are often not exactly linear and that is why μ_y = α + βx will always approximate a certain relationship, but never 100% be the actual relationship. That is why it is a model. If the true relationship between the variables may be far of the straight line, than the regression model may be a very poor model.
The null hypothesis that states that x and y are statistically independent is H₀: β = 0. This significance test has the same purpose as a chi-squared test of independence has for categorical variables. The smaller the p-value is, the greater the evidence. The alternative hypothesis for the test of independence is mostly two-sided, H_a: β ≠ 0, but can also be one-sided.
Both the correlation r and its square r² describe the strength of the association, but they have different interpretations. The correlation falls between -1 and +1. It represents how many standard deviations the y is predicted to change when the x also changes by one standard deviation. Also, it governs the extent of 'regression towards the mean'. The r² measure falls between the 0 and 1 and it summarizes the reduction in prediction error when using the regression equation rather than the ȳ to predict the y. And the r² is also used to indicate how much of the variability in y can be explained by x.

What is multiple regression? - BulletPoints 13

When you have several explanatory variables, you can make better predictions using all of the variables at once. That is the idea behind a multiple regression. But besides helping you to predict the response variable better, multiple regression can help you analyze association between two variables while controlling for another variable/keeping it fixed. That is very important because the effect of explanatory variables can change very much after you take a potential lurking variable into account.
When you have two explanatory variables, denoted by x₁ and x₂ you get, instead of a simple regression model, a multiple regression model: μ_y= α + β₁x₁ + β₂x₂. The alpha en both of the bèta's are parameters. The sample prediction equation is equal to ŷ = a + b₁x₁ + b₂x₂ + b₃x₃.
The simplest way to interpret a multiple regression equation is to look at the two dimension as a function of a single explanatory variable. You can do this by fixing the other values of the other explanatory variables. But you have to keep in mind that it is not possible to compare the slopes of the explanatory variables because their units of measurement are not the same, the slopes can't be compared when the units differ. The multiple regression model states that each explanatory variable has a straight-line relationship with the mean of y, given that the other explanatory variables values are fixed. The model assumes that the slope for a parameter explanatory variable is identical for all fixed values of the other explanatory variable.
When you want to know how well a multiple regression model predicts y, you can analyze how well the observed y values correlate with the predicted ŷ values. The explanatory variables are strongly associated with the y if the correlation between the y and ŷ values is strong. The correlation between the observed y values and the predicted ŷ values from the mulitple regression model is called the multiple correlation. It is denoted by R. The larger the multiple correlation is, the better the predictions of y by the set of explanatory variables are. The predicted values ŷ cannot correlate negatively with the y, because the predictions are worse than when you use ȳ to predict y. So, R falls between the 0 and 1. In this way, the multiple correlation R differs from the correlation r, because correlation r falls between the -1 and +1.
The sampling distribution of the F test statistic is called the F distribution. The F distribution can assume only the nonnegative values and is, like the chi-squared test, also skewed to the right. The precise shape is determined by the two degrees of freedom terms denoted by df₁ and df₂.
Regression models specify categories of categorical explanatory variables using artifical variables called the indicator variables. These variables for a particular category is binary. It will equal 1 if the observation falls into that category, and equals 0 otherwise.
You denote the possible outcomes by y for 0 = failure, and 1 = success. The mean μ_y = p denotes the population proportion of successes. This proportion also represents the probability that a randomly selected subject has a success outcome. The model describes how p depends on the values of the explanatory variables. But the p can have a value larger than 1, when the x values are extremely large. Also, the p can have a value under 0 when the x values are extremely small. But the probability p should always fall between 0 and 1. So, with multiple explanatory variables the model shows a S-shape instead of a straight-line trend. The regression equation that best models this S-curve is the logistic regression equation.

How do you compare groups: analysis of variance methods? - BulletPoints 14

The methods that they are mentioning in this chapter apply when a quantitative response variable has a categorical explanatory variable. The categories of the explanatory variable identify the groups to be compared in terms of their means on the response variable. The inferential method for comparing means of several groups is called analysis of variance, or denoted by the name ANOVA. The name analysis of variance is about the significance test that focuses on two types of variability in the data. The categorical explanatory variables in the multiple regression and in ANOVA are often referred to as factors. When the ANOVA has one factor, the ANOVA is called an one-way ANOVA. When the ANOVA has two factors, then the ANOVA is called a two-way ANOVA.
The analysis of variance test is a significance test where the null hypothesis is equal to H₀: μ₁= μ₂ = ... = μ_g. The null hypothesis states that the population mean for all the three variables are the same. The alternative hypothesis states that H_a: at least two of the population means are unequal. If the null hypothesis is false, then perhaps all the population means differ from each other, but also just one mean can differ from the other means. The test analyzes whether the differences in the sample mean could have reasonably occured by chance if the null hypothesis of equal population means were true.
The ANOVA is used to compare the population means, but why is it called the analysis of variance? The reason is that the test statistic uses evidence about two types of variability. What is the difference between a between groups and a within groups means? The variability between group means is the same in each case beacause the group means are identical. However, the variability within each group is much smaller. Because it shows less variability, this means that the evidence against H₀ is also stronger when the variability between group means increases and the sample size increases. This is why the ANOVA test statistic compares the variability between and within groups.
When the sample sizes of the groups are not equal, then the within-groups estimate is a weighted average of the sample variances. The greater weight will be given to samples with larger sample sizes. But this estimate is in either case unbiased, its sampling distribution has σ² as its mean, regardless of whether the null hypothesis is true or not. The estimate of σ² in the numerator of the F test statistic uses the variability between each sample mean and the overall sample mean for all the data.
The sum of the between groups sum of squares and the within-groups sum of squares is the total sum of squares. The analysis of variance partitions the total sum of squares in two independent parts, namely the between-groups SS and the within-groups SS.
When you have g groups, there are g(g -1)/2 pairs of groups to compare. So when g = 3, there are 3(2)/2 = 3 comparisons needed. This is group 1 and 2, group 2 and 3, and group 1 and group 3. When you have a 95% confidence interval, the confidence level is 0.95 that the probability that any particular confidence interval that we plan to construct will contain the parameters. The confidence that all the confidence intervals will contain the parameters is considerably smaller than the confidence level for any particular interval. Methods that control probability that all the confidence intervals will contain true differences in means are called multiple comparison methods. All intervals are designed to contain the true parameters simultaneously with an overall fixed probability. So the multiple comparison methods compare pairs of means with a confidence level that applies simultaneously to the entire set of comparisons rather than to each separate comparison.
Suppose you want a confidence level of 0.95 that all confidence intervals will be simultaneously correct, you plan to construct five CI comparing means. So the method uses error probability 0.05/5 = 0.01 for each one with a confidence interval level of 99% for each interval. This approach ensures that the overall confidence is at least 0.95. This is called the Bonferroni method. But you may often use the Tukey method, this is designed to give an overall confidence level very close to the desired value. Also, the CI are slightly narrower than the Bonferroni intervals. The Tukey method is a bit complicated, but you can use software to work with its formula.
ANOVA can be seen as a special case of a multiple regression. The factor defining the groups enters the regression model as indicator variables. Each indicator variable can take on two values, 0 and 1. And it also indicates whether an observation falls in a particular group. When you have three groups you'll need two indicator variables to indicate the groups. So,

x₁ = 1 for observations from the first group, = 0 otherwise

x₂ = 1 for observations from the second group, 0 = otherwise
So: Group 1: x₁ = 1, x₂ = 0, Group 2: x₁ = 0, x₂ = 1, Group 3: x₁ = 0, x₂ = 0.

Why is it that you perform a two-way ANOVA and not two one-way ANOVA's? The main reason is that you will learn more about the interaction when you use a two-way ANOVA. It is more informative to compare levels of one factor seperately at each level of the other factor. This enables you to investigate how the effect depends on the other factor there is present. Such as with an experiment, it is better to do one experiment where you carry out a experiment where you study both factors at the same time. Also, the residual variability of a two-way ANOVA, which effects the MS error and the denominators of the F test statistic, tends to decrease when using the two-way ANOVA. When you use two factors to predict a response variable, you get better predictions than when you will use one factor. When you have less residual variability, you will get larger test statistics and greater power for rejecting the null hypothesis.

What does nonparametric statistics mean? - BulletPoints 15

Nonparametric statistics are known to be an alternative way to compare two groups without having to assume a normal distribution for the response variable. They solely use the ranking of the subjects on the response variable. They are especially useful in these two cases: 1) When the data are ranks for the subjects rather than quantitative measurements, 2) When it is inappropriate to assume normality, and when the ordinary statistical method is not robust to violations of the normality assumption. We might prefer to not assume normality because we think that the distribution will be skewed when we do. Or, when we have no idea about the distribution shape, and the sample size is too small, it will also give you a lot of information that you otherwise will miss.
For example you have an experiment regarding tanning lotion and using the tanning studio. Suppose that the two treatments would have had identical effects, then each of the ten possible outcomes is equally likely to happen. Each possible outcome (here 10) has a probability of 1/10. When you use these 10 possible outcomes you can construct a sampling distribution for the difference between the mean ranks. For instance, the null hypothesis states H₀: the treatments are identical in tanning quality, and the alternative hypothesis states that H_a: there are better tanning quality results with the tanning studio. This alternative hypothesis is one-sided. If you want to find the p-value, you presume that the null hypothesis is true. The p-value is the probability of a difference between the sample mean ranking like the observed difference or even more extreme, in terms of giving more evidence in favor of the H_a. The test comparing the two groups based on the sampling distribution of the difference between the sample mean ranks is called the Wilcoxon test.
The test to compare mean ranks of several groups is called the Kruskal-Wallis test. The test assumes that the samples are independent random samples, and the null hypothesis also states that the groups have identical population distributions for the response variable. The test determines the ranks for the entire sample and then finds the sample mean rank for each group. The test statistic is based on between-groups variability to calculate this test statistic.The constant is made a certain way so that the test statistic values have approximately a chi-squared sampling distribution. This sampling distribution indicates whether the variability among the sample mean ranks is large compared to what's expected under the null hypothesis that the groups have identical population distributions.
When you have matched pairs, for each pair the sign test merely observes which treatment does better, but not how much better the treatment is compared to the other treatment. The Wilcoxon signed-rank test is a nonparametric test designed for cases in which comparisons of the paired observations themselves can be ranked. For each matched pair, it measures the difference between the responses. The null hypothesis states that H₀: population median of difference scores is 0. The test statistic is the sum of the ranks for those differences that are positive.
But statisticans have also shown that the nonparametric tests are often very nearly as good as the parametric tests are, even in cases for which parametric tests are designed. But, there haven't been developed some nonparametric tests for multivariate procedures such as multiple regression. And, recall that in many cases the parametric methods are robust, so they are still working well even if the assumptions are somewhat violated.

Access:

Public

Check more: click and go to more related summaries or chapters

Study Guide for summaries with Statistics: The art and science of learning from data by Agresti and Franklin

Summary of Statistics: The art and science of learning from data by Agresti and Franklin - 4th edition - Exclusive

BulletPointsummary of Statistics: The art and science of learning from data by Agresti and Franklin - 4th edition

Statistics: summaries and study assistance - Theme

Verzekeren bij een faire en solidaire zorgverzekeraar?

Join: WorldSupporter!

Join with a free account for more service, or become a member for full access to exclusives and extra support of WorldSupporter >>

Check: concept of JoHo WorldSupporter

Concept of JoHo WorldSupporter

JoHo WorldSupporter mission and vision:

JoHo wants to enable people and organizations to develop and work better together, and thereby contribute to a tolerant and sustainable world. Through physical and online platforms, it supports personal development and promote international cooperation is encouraged.

JoHo concept:

As a JoHo donor, member or insured, you provide support to the JoHo objectives. JoHo then supports you with tools, coaching and benefits in the areas of personal development and international activities.
JoHo's core services include: study support, competence development, coaching and insurance mediation when departure abroad.