Summary of Statistical Methods for the Social Sciences by Agresti - 6th edition - Exclusive
- What are statistical methods? – Chapter 1
- What kind of samples and variables are possible? – Chapter 2
- What are the main measures and graphs of descriptive statistics? - Chapter 3
- What role do probability distributions play in statistical inference? – Chapter 4
- How can estimates for statistical inference be made? – Chapter 5
- How do you perform significance tests? – Chapter 6
- How do you compare two groups in statistics? - Chapter 7
- How do you analyze the association between categorical variables? – Chapter 8
- How do linear regression and correlation work? – Chapter 9
- What type of multivariate relationships exist? – Chapter 10
- What is multiple regression? – Chapter 11
- What is ANOVA? – Chapter 12
- How does multiple regression with both quantitative and categorical predictors work? – Chapter 13
- How do you make a multiple regression model for extreme or strongly correlating data? – Chapter 14
- What is logistic regression? – Chapter 15
- What advanced methodologies are there? - Chapter 16
What are statistical methods? – Chapter 1
What is statistics and how can you learn it?
Statistics is used more and more often to study the behavior of people, not only by the social sciences but also by companies. Everyone can learn how to use statistics, even without much knowledge of mathematics and even with fear of statistics. Most important are logical thinking and perseverance.
To first step to using statistical methods is to collect data. Data are collected observations of characteristics of interest. For instance the opinion of 1000 people on whether marihuana should be legal. Data can be obtained through questionnaires, experiments, observations or existing databases.
But statistics aren't only numbers obtained from data. A broader definition of statistics entails all methods to obtain and analyze data.
What is the difference between descriptive and inferential statistics?
Before being able to analyze data, a design is made on how to obtain the data. Next there are two sorts of statistical analyses; descriptive statistics and inferential statistics. Descriptive statistics summarizes the information obtained from a collection of data, so the data is easier to interpret. Inferential statistics makes predictions with the help of data. Which kind of statistics is used, depends on the goal of the research (summarize or predict).
To understand the differences better, a number of basic terms are important. The subjects are the entities that are observed in a research study, most often people but sometimes families, schools, cities etc. The population is the whole of subjects that you want to study (for instance foreign students). The sample is a limited number of selected subjects on which you will collect data (for instance 100 foreign students from several universities). The ultimate goal is to learn about the population, but because it's impossible to research the entire population, a sample is made.
Descriptive statistics can be used both in case data is available for the entire population and only for the sample. Inferential statistics is only applicable to samples, because predictions for a yet unknown future are made. Hence the definition of inferential statistics is making predictions about a population, based on data gathered from a sample.
The goal of statistics is to learn more about the parameter. The parameter is the numerical summary of the population, or the unknown value that can tell something about the ultimate conditions of the whole. So it's not about the sample but about the population. This is why an important part of inferential statistics is measuring and crediting how representative a sample is.
A population can be real (for instance foreign students) or conceptual (for instance the foreign students that will pass their statistics course this year).
What part does software play in statistics?
Software enables an easy application of complex methods. The most used software for statistics are SPSS, R, SAS and Stata.
What kind of samples and variables are possible? – Chapter 2
What kind of variables can be measured?
All characteristics of a subject that can be measured are variables. These characteristics can vary between different subjects within a sample or within a population (like income, sex, opinion). The use of variables is to indicate the variability of a value. As as example, the number of beers consumed per week by students. The values of a variable constitute the measurement scale. Several measurement scales, or ways to differ variables, are possible.
The most important divide is that between quantitative and categorical variables. Quantitative variables are measured in numerical values, such as age, numbers of brothers and sisters, or income. Categorical variables (also called qualitative variables) are measured in categories, such as sex, marital status, or religion. The measurement scales are tied to statistical analyses: for quantitative variables it is possible to calculate the mean (i.e. the average age), but for categorical variables this isn't possible (i.e. there is no average sex).
There are four measurement scales: nominal, ordinal, interval and ratio. Categorical variables have nominal or ordinal scales:
- The nominal scale is purely descriptive. For instance with sex as a variable, the possible values are man and woman. There is no order or hierarchy, one value isn't higher than the other.
- The ordinal scale on the other hand assumes a certain order. For instance happiness. If the possible values are unhappy, considerably unhappy, neutral, considerably happy, and ecstatic, then there is a certain order. If a respondent indicates to be neutral, this is happier than considerably unhappy, which in turn is happier than unhappy. Important is that the distances between the values cannot be measured, this is the difference between ordinal and interval.
Quantitative variables have an interval or ratio scale:
- Interval means that there are measurable differences between the values. For instance temperate in Celcius. There is an order (30 degrees is more than 20) and the difference is clearly measurable and consistent. For an interval scale the value can't be zero.
- For a ratio scale the value can be zero. The ratio scale has numerical values, with a certain order, with measurable differences and with zero as a possible value. Examples are percentage or income.
Furthermore there are discrete and continuous variables:
- A variable is discrete when the possible values can only be limited, separate numbers. For instance the number of brothers and sisters is discrete, because it's not possible to have 2.43 brother/sister.
- A variable is continuous when the values can be anything possible. Weight is an example of a continuous variable, as it's possible to weigh 70 kilo but also 70.52 kilo.
Categorical variables (nominal or ordinal) are always discrete because they have a limited number of categories. Quantitative variables can be both discrete or continuous. When quantitative variables happen to be able to have lots of possible values, they are considered continuous.
How does randomization work?
Randomization is the mechanism of obtaining a representative sample. In a simple random sample every subject of the population has an equal chance of becoming part of the sample. The randomness is important, because it needs to be guaranteed that the data isn't biased. Biased information would make inferential statistics useless, because then it's impossible to say anything about the population.
For a random sample a sampling frame is necessary; a list of all subjects within the population. Next all subjects are numbered and then at random numbers are drawn. Drawing random numbers can be done using software, for instance R. In R the following formula is used:
\[\text{> sample(1:60, 4) #}\]
\[\text{[1] 22 47 38 44 #}\]
The symbol > indicates that the program needs to execute a task. In this sample the goal is to select four random subjects from a list of 60 subjects in total. The program indicates which subjects are chosen: numbers 22, 47. 38 and 44.
Data can be collected using surveys, experiments and observational studies. All these methods can have a degree of randomization.
Different types of surveys are possible; online, offline etc. Every way to gather data has challenges in terms of representing the population accurately.
Experiments are used to measure and compare the reactions from subjects under different conditions. These conditions, so called treatments, are values of a variable that can influence the reaction. It is up to the researcher to decide which subjects will follow which treatments. This is where randomization plays a part; the researcher needs to divide the subjects into groups randomly. In this case an experimental design is used to constitute which subjects will follow which treatments.
In observational studies the researcher measures the values of variables without influencing or manipulating the situation. Who will be observed, is determined at random. The biggest risk of this method is that a variable that influences the results remains unseen.
How do you control variability and bias?
In theory, a measure must be valid, which means that it is clear what it's supposed to measure and that it accurately reflects this concept. A measure must also be reliable, meaning that it's consistent and a respondent would give the same answer again when asked twice. In reality however all kinds of factors can influence a research.
Even in the case of multiple completely random samples, the samples will differ in the way that they are different from the population. This difference is called the sampling error; how much the statistic that is drawn from the sample differs from the parameter that indicates the value in the population. In other words, the sampling error indicates the percentage that the sample can differ from the actual population. If in the population 66% agrees with government policy, but in the sample 68%, then the sampling error is 2%. In most cases for samples of over a 1000 subjects the sampling error remains limited to 3%. This is called the margin of error. This concept is often used in statistics because it can say something about the quality of a sample.
Apart from sampling error there are other factors that influence the results from random samples, such as sampling bias, response bias and non-response bias.
In probability sampling the chance of every possible sample is known. In nonprobability sampling however this is not known, the reliability is unknown and sampling bias can happen. So the sampling bias occurs in case it's not possible to guarantee that all members of the population have an equal chance to become part of the sample. This happens for instance when only volunteers take part in a research. Volunteers can be different from people that choose not to participate. The difference for certain variables that the volunteers cause, is called selection bias.
When questions in a survey or interview are asked in a certain fashion or sequence, response bias can occur. The interviewers may want to get socially desirable answers, with questions such as “Do you agree that...?” The respondents prefer not to disagree with the interviewer and are more inclined to agree, even if they might not want to. Also the general inclination to give answers that people think the interviewer favors, is part of response bias.
Non-response bias happens when people quit during research or other factors result in missing data. Some people choose not to answer certain questions, for various reasons. When people decide to quit, they may have different values on important variables compared to the respondents that remain. This can influence the data, even in a random sample.
Which methods can be used for probability sampling?
Apart from simple random samples, there are other possible methods. There are cases when a completely unselected sample isn't possible. Sometimes it can be desirable or easier not to use a completely unselected sample. There are other methods that still use probability sampling (so that the chance is known for every possible sample) and randomization (to have a representative sample as a goal).
In a systematic random sample the subjects are chosen in a systematic manner, by consistently skipping a certain number of subjects. An example is selecting every tenth house in a street. The formula for this method is:
\[k=\frac{N}{n}\]
- k is the skip number, the selected subject after other subjects are skipped
- N is the population
- n is the sample size
A stratified sample divides the population in groups, also called strata. From each stratum a number of subjects is chosen at random that will form the sample. This can be proportional or disproportional. In a porportional stratified sample the proportions in the strata are equal to the proportions in the population. If for instance 60% in the population is male and 40% is female, then this needs to be the same in the sample. Sometimes is may be better to use a disproportional stratified sample. If in a sample of only 100 subjects only 10% is female, it doesn't make sense to have those 10 women all in the sample. A number like that is too small to be representative, then no conclusions can be drawn about the actual population. In that case it's better to choose a disproportional stratified sample.
Most samples require access to the entire population, but in reality this may not be given. In that case cluster sampling may be an option. This requires dividing the population in clusters (for instance city districts) and randomly choosing one cluster. The difference with stratified samples is that not every cluster is represented.
Another option is multistage sampling; several layered samples. For instance first provinces are selected, then cities within those provinces and then streets within those cities.
What are the main measures and graphs of descriptive statistics? - Chapter 3
Which tables and graphs display data?
Descriptive statistics serve to create an overview or summary of data. There are two kinds of data, quantitative and categorical, each has different descriptive statistics.
To create an overview of categorical data, it's easiest if the categories are in a list including the frequence for each category. To compare the categories, the relative frequencies are listed too. The relative frequency of a category shows how often a subject falls within this category compared to the sample. This can be calculated as a percentage or a proportion. The percentage is the total number of observations within a certain category, divided by the total number of observations * 100. Calculating a proportion works largely similar, but then the number isn't multiplied by 100. The sum of all proportions should be 1.00, the sum of all percentages should be 100.
Frequencies can be shown using a frequency distribution, a list of all possible values of a variable and the number of observations for each value. A relative frequency distributions also shows the comparisons with the sample.
Example (relative) frequency distribution:
Gender | Frequence | Proportion | Percentage |
Male | 150 | 0.43 | 43% |
Female | 200 | 0.57 | 57% |
Total | 350 (=n) | 1.00 | 100% |
Aside from tables also other visual displays are used, such as bar graphs, pie charts, histograms and stem-and-leaf plots.
- A bar graph is used for categorical variables and uses a bar for each category. The bars are separated to indicate that the graph doesn't display quantitative variables but categorical variables.
- A pie chart is also used for categorical variables. Each slice represents a category. When the values are close together, bar graphs show the differences more clearly than pie charts.
Frequency distributions and other visual displays are also used for quantitative variables. In that case, the categories are replaced by intervals. Each interval has a frequence, a proportion and a percentage.
- A histogram is a graph of the frequency distribution for a quantitative variable. Each value is represented by a bar, except when there are many values, then it's easier to divide them into intervals.
- A stem-and-leaf plot represents each observation using a stem and a leaf; two numbers that form an observation if you put them together. This kind of graph only is useful if there is few data available and you want to show the data quickly.
When visual displays are given for a population, then they're called population distributions. When they're given for samples, they're called sample distributions.
The data can be shown using a curve in a graph. The bigger the sample and the more data, the more similarities between the sample graph and the curve of the population. The shape of a graph contains information on the distribution of the data. Most used is the normal distribution, a bell shape. This shape is symmetrical. If the x-axis indicates the value of a variable, then the y-axis indicates the relative frequency of the value. The highest point is in the middle, so the value in the middle is the most prevalent.
Another possibility is a U-shaped graph. The most prevalent values are then the lowest and the highest scores, which indicates polarization.
The two ends of a curve are called tails. If one tail is longer than the other and the distribution isn't symmetrical, then the distribution must be skewed either to the right or to the left.
How do you describe the center of data using mean, median and mode?
The average is the most well known measure to describe the center of data for a frequency distribution of a quantitative variable. The average is also called the mean and it is calculated as the sum of the observations divided by the total number of observations. For example, if a variable (y) has the values 34 (y1), 55 (y2) and 64 (y3), then the mean (ȳ) is (34 + 55 + 64)/3 = 51. The mean is pronounced as y-bar.
The formula for calculating the mean is:
\[\bar{y}=\frac{\sum{y_i}}{n}\]
The symbol ∑ is the Greek letter sigma, this means the sum of what is behind. The small letter i means 1 till n (the sample size). So ∑ yi means y1 + y2 + … + yn (the sum of all observations).
The mean can only be used for quantitative data and is very sensitive to outliers; exceptionally high or low values.
For multiple samples (n1 and n2), multiple means can be found (ȳ1 and ȳ2).
Another way to describe the center is the median. The median is the observation that falls in the middle of the ordered sample. If a variable has values 1, 3, 5, 8 and 10, then the median is 5. In case of an even number of observations, such as 1, 3, 8 and 10, then the median is (3 + 8)/2 = 5,5.
Important rules about the median are:
- Apart from quantitative data the median can also be found for categorial data on an ordinal scale, because the median requires a certain order in the observations.
- For completely symmetrical data the median and the mean should be the same.
- The mean lies closer to the tail than the median for a skewed distribution.
- The median is not sensitive to outliers. This is both positive and negative. On the one hand, if there is just one outlier in the data, the median doesn't give a biased portray of the data. On the other hand, there can be a huge variability and the median might still give the same value.
Compared to the mean, the median represents the sample better in case of outliers. The median gives more information if the distribution is very skewed. However, there are also cases where the median is less favorable for representing the data. When the data is only binary (only 0 or 1), then the median is the proportion of the number of times that 1 is observed. Also in other cases where the data is highly discrete, the mean represents the data better than the median does.
Another position is the mode; the value that is most prevalent. The mode is useful for very discrete variables, mostly categorical data.
How can you measure the variability of data?
The variability of data refers to the values of a variable from the data, for instance the income from the respondents. The variability can be displayed in several ways.
First, the range can be calculated; the difference between the lowest and the highest observation. As an example for the values 4, 10, 16 and 20. The range is 20 – 4 = 16.
However, the most used method for showing the variability of data, is calculating the standard deviation (s). A deviation is the difference between a measured value (yi) and the mean of the sample (ȳ), so it is (yi – ȳ). Every observation has its own deviation, positive when the observation has a higher value than the mean, negative when the observation has a lower value than the mean. It's possible to calculate this for each observation separately but it's also possible to calculate the standard deviation of a variable, by using the sum of all deviations. The formula for the standard deviation is:
\[s=\sqrt{\frac{\sum{(y_i-\bar{y})^2}}{n-1}}\]
The upper part of the formula, ∑ (yi – ȳ)2, is called the sum of squares. This part squares all the deviations from the observations. The information given by the standard deviation, is how much an observation typically deviates from the mean, so how much the data varies. When the standard deviation is 0, there is no variability at all.
The variance is:
\[s^2=\sqrt{\frac{\sum{(y_i-\bar{y})^2}}{n-1}}\]
The variance is the mean of the squares of the deviations. The standard deviation is used more often as an indication of the variability than the variance.
When data is available for the entire population, then instead of n-1 the population size is used for calculating the standard deviation.
For interpreting s, the so-called empirical rule can be used for bell-shaped distributions:
- 68% of data lies between ȳ – s and ȳ + s.
- 95% of data lies between ȳ – 2s and ȳ + 2s.
- Most or all of observations lie between ȳ – 3s and ȳ + 3s.
Outliers have a big effect on the standard deviation.
How can you measure quartiles and other positions on a distribution?
Distributions can be interpreted with several kinds of positions. One way to divide a distribution in parts, is using percentiles. The pth percentile is the point where p% of the observations fall below or at that point and the rest of observations, (100-p)%, falls above. A percentile indicates a point in a graph, not part of a graph.
Another way is to divide a distribution in four parts. The 25th percentile is then called the lower quartile and the 75th percentile the upper quartile. Half of data is inbetween and is called the interquartile range (IQR). The median splits the IQR in two parts. The lower quartile is the median of the first half and the upper quartile is the median of the second half. An advantage of the IQR compared to the range and the standard deviation is that the IQR is insensitive to outliers.
Five positions are often used to give a summary of a distribution: minimum, lower quartile, median, upper quartile and maximum. The positions can be shown in a boxplot, a graph that indicates the variability of data. The box of a boxplot contains the central 50% of the distribution.
The horizontal lines of a boxplot towards the minimum and maximum are called the whiskers. Extreme outliers are indicated with a spot outside of the whiskers. An observation is regarded an outlier when it falls more than 1,5 IQR below the lower quartile or above the upper quartile. A boxplot makes the outliers very explicit, this should be a trigger for the researcher to check again if the research methods have been used properly.
Several sorts of graphs help to compare two or more groups, for instance a relative frequency distribution, histogram or two boxplots next to each other.
Another position is the z-score. This is the number of standard deviations that a value differs from the mean. The formula is: z = (observation – mean) / standard deviation. Contrary to other positions, the z-score can give information about a specific value.
How do you call statistics for multiple variables?
Statistics is often about the association between two variables; whether one variable has an influence on another. This is called bivariate analysis.
Most often a research studies the effect of an explanatory variable (also called independent variable) on a response variable (also called dependent variable). The output of the response variable is caused by the explanatory variable.
The influence from one variable on another can be portrayed graphically in several ways. A contingency table lists the results with the combination of variables. A scatterplot is a graph with the explanatory variable on the x-axis and the response variable on the y-axis. For every outcome that suffices both variables a dot is shown. The intensity of an association is called the correlation. Regression analysis predicts the value of y for a given value x. When an association exists between variables, this doesn't necessarily mean that there is causality. For multiple variables, multivariate analysis is used.
Which letters are used in formulas to mark the difference between the sample and the population?
In statistics it's important not to loose sight of the difference between the statistic that describes only the sample and the parameter that describes the entire population. Greek letters are used for the population parameters, Roman letters are used for the sample statistics. For a sample ȳ indicates the mean and s indicates the standard deviation. For a population μ indicates the population mean and σ the standard deviation of the population. The mean and the standard deviation can also be regarded as variables. For a population this isn't possible, because there is only one population.
What role do probability distributions play in statistical inference? – Chapter 4
What are the basic rules of probability?
Randomization is important for collecting data, the idea that the possible observations are known but it's yet unknown which possibility will prevail. What will happen, depends on probability. The probability is the proportion of the number of times that a certain observation is prevalent in a long sequence of similar observations. The fact that the sequence is long, is important, because the longer the sequence, the more accurate the probability. Then the sample proportion becomes more like the population proportion. Probabilities can also be measured in percentages (such as 70%) instead of proportions (such as 0.7). A specific branch within statistics deals with subjective probabilities, called Bayesian statistics. However, most of statistics is about regular probabilities.
A probability is written like P(A), where P is the probability and A is an outcome. If two outcomes are possible and they exclude each other, then the chance that B happens is 1- P(A).
Imagine research about people's favorite colors, whether this is mostly red and blue. Again the assumption is made that the possibilities exclude each other without overlapping. The chance that someone's favorite color is red (A) or blue (B), is P(A of B) = P (A) + P (B).
Next, imagine research that encompasses multiple questions. The research seeks to investigate how many married people have kids. Then you can multiply the chance that someone is married (A) with the chance that someone has kids (B).The formula for this is: P(A and B) = P(A) * P(B if also A). Because there is a connection between A and B, this is called a conditional probability.
Now, imagine researching multiple possibilities that are not connected. The chance that a random person likes to wear sweaters (A) and the chance that another random person likes to wear sweaters (B), is P (A and B) = P (A) x P (B). These are independent probabilities.
What is the difference in probability distributions for discrete and continuous variables?
A random variable means that the outcome differs for each observation, but mostly this is just referred to as a variable. While a discrete variable has set possible values, a continuous variable can assume any value. Because a probability distribution shows the chances for each value a variable can take, this is different for discrete and continuous variables.
For a discrete variable a probability distribution gives the chances for each possible value. Every probability is a number between 0 and 1. The sum of all probabilities is 1. The probabilities are written P(y), where P is the probability that y has a certain value. In a formula: 0 ≤ P(y) ≤ 1, and ∑all y P(y) = 1.
Because a continuous variable has unlimited possible values, a probability distribution can't show them all. Instead a probability distribution for continuous variables shows intervals of possible values. The probability that a value falls within a certain interval, is between 0 and 1. When an interval in a graph contains 20% of data, then the probability that a value falls within that interval is 0,20.
Just like a population distribution, a probability distribution has population parameters that describe the data. The mean describes the center and the standard deviation the variability. The formula for calculating a mean of the population distribution for a discrete variable is:
\[\mu=\sum{yP(y)}
This parameter is called the expected value of y and in written form it's E(y).
How does the normal distribution work?
The normal distribution is useful because many variables have a similar distribution and the normal distribution can help to make statistical predictions. The normal distribution is symmetrical, shaped like a bell and it has a mean (µ) and a standard deviation (σ). The empirical rule is applicable to the normal distribution: 68% falls within 1 standard deviation, 95% within 2 standard deviations and 97% within 3 standard deviations.
The number of standard deviations is indicated as z. Software such as R, SPSS, Stata and SAS can find probabilities for a normal distribution. In case of a symmetrical curve, the probabilities are cumulative, meaning that z has the same distance to the mean on the left and on the right. The formula for z is:
\[z=\frac{y-\mu}{\sigma}\]
The z-score is the number of standard deviations that a variable y is distanced from the mean. A positive z-score means that y falls above the mean, a negative score means that it falls below.
Otherwise, when P is known, then y can be found. Software helps to find the z-score for finding probabilities in a distribution. The formula is:
\[y=\mu+z*\sigma\]
A special kind of normal distribution is the standard normal distribution, which consists of z-scores. A variable y can be converted to z by subtracting the mean and then dividing it by the standard deviation. Then a distribution is created where µ = 0 and σ = 1.
A bivariate normal distribution is used for bivariate probabilities. In case of two variables (y and x), there are two means (µy and µx) and two standard deviations (σy and σx). The covariance is the way that y and x vary together: Covariance (x, y) = E[(x – µx)(y – µy)]
What is the difference between sample distributions and sampling distributions?
A simulation can tell whether an outcome of a test such as a poll is a good representation of the population. Software can generate random numbers.
When the characteristics of the population are unknown, samples are used. Statistics from samples give information about the expected parameters for the population. A sampling distribution shows the probabilities for sample measures (this is not the same as a sample distribution that shows the outcome of the data). For every statistic there is a sampling distribution, such as for the sample median, sample mean etc. This kind of distribution shows the probabilities that certain outcomes of that statistic may happen.
A sampling distribution serves to estimate how close a statistic lies to its parameter. A sampling distribution for a statistic based on n observations is the relative frequency distribution of that statistic, that in turn is the result of repeated samples of n. A sampling distribution can be formed using repeated samples but generally its form is known already. The sampling distribution allows to find probabilities for the values of a statistic of a sample with n observations.
How do you create the sampling distribution for a sample mean?
When the sample mean is known, its proximity to the population mean may still be a mystery. It's still unknown whether ȳ = µ. However, the sampling distribution creates indications, for instance a high probability that ȳ is within ten values of µ. In the end, when a lot of samples are drawn, the mean of a sampling distribution equals the mean of the population.
The variability of the sampling distribution of ȳ is described by the standard deviation of ȳ, called the standard error of ȳ. This is written as σȳ. The formula for finding the standard error is:
\[\sigma_{\bar{y}}=\frac{\sigma}{\sqrt{n}}\]
The standard error indicates how much the mean varies per sample, this says something about how valuable the samples are.
For a random sample of size n, the standard error of ȳ depends on the standard deviation of the population (σ). When n gets bigger, the standard error becomes smaller. This means that a bigger sample represents the population better. The fact that the sample mean and the population mean are different, is called the sampling error.
The standard error and the sampling error are two different things. The sampling error indicates that the sample and the population are different in terms of the mean. The standard error measures how much samples differ from each other in terms of the mean.
No matter how the population distribution is shaped, the sampling distribution of ȳ is always a normal distribution. This is called the Central Limit Theorem. Even if the population distribution has very discrete values, the sampling distribution is a normal distribution. However, when the population is very skewed over a matter, the sample needs to be big for the sampling distribution to have the normal shape. For small samples the Central Limit Theorem can't necessarily be used.
Just like the standard error, the Central Limit Theorem is useful for finding information about the sampling distribution and the sample mean ȳ. Because it has a normal distribution, the Empirical Rule can be applied.
What is the connection between the population, the sample data and the sampling distribution?
To understand sampling, distinguishing between three distributions is important:
- The population distribution describes the entirety from which the sample is drawn. The parameters µ and σ denote the population mean and the population standard deviation.
- The sample data distribution portrays the variability of the observations made in the sample. The sample mean ȳ and the sample standard deviation s describe the curve.
- The sampling distribution shows the probabilities that a statistic from the sample, such as the sample mean, has certain values. It tells how much samples can differ.
The Central Limit Theorem says that the sampling distribution is shaped like a normal distribution. Information can be deducted just from this shape. The possibility to retrieve information from the shape, is the reason that the normal distribution is very important to statistics.
How can estimates for statistical inference be made? – Chapter 5
How do you make point estimates and interval estimates?
Sample data is used for estimating parameters that give information about the population, such as proportions and means. For quantitative variables the population mean is estimated (like how much money on average is spent on medicine in a certain year). For categorical variables the population proportions are estimated for the categories (like how many people do and don't have medical insurance in a certain year).
Two kinds of parameter estimates exist;
- A point estimate is a number that is the best prediction.
- An interval estimate is an interval surrounding a point estimate, which you think contains the population parameter.
There is a difference between the estimator (the way that estimates are made) and the estimate point (the estimated number itself). For instance, a sample is an estimator for the population parameter and 0,73 is an estimate point of the population proportion that believes in love at first sight.
A good estimator has a sampling distribution that is centered around the parameter and that has a standard error as small as possible.
An estimate isn't biased when the sampling distribution is centered around the parameter. This is especially the case when the sample mean is the population parameter. In that case ӯ (sample mean) equals µ (population mean). ӯ is then regarded a good estimator for µ.
When an estimate is biased, the sample mean doesn't estimate the population mean well. Usually the sample mean is below, because the extremes from a sample can't be more than those of the population, only less. The sample variety is smaller, allowing the sample variety to underestimate the population variety.
An estimator should also have a small standard error. An estimator is called efficient when the standard error is smaller than that of other estimator. Imagine a normal distribution. The standard error of the median is 25% bigger than the standard error of the mean. The sample mean is closer to the population mean than the sample median is. The sample mean is a more efficient estimator then.
A good estimator is unbiased (meaning the sampling distribution is centered around the parameter) and efficient (meaning it has the smallest standard error).
Usually the sample mean serves as an estimator for the population mean, the sample standard deviation as an estimator for the population standard deviation, etc. This is indicated by a hat on a symbol, for instance:
\[\hat{\mu}\]
- means an estimate of the population mean µ.
A confidence interval is an interval estimate for a parameter. Only reliable estimates of the parameter are in this interval. To find this interval, look at the sample distribution, which is a normal distribution. For a confidence interval with 95% security, the estimate of the parameter is within two standard errors of the mean. To calculate this, multiply the standard error with the z-score. Ad and subtract the outcome to the point estimate, so you get two numbers, that together form the confidence interval. Now it is 95% guaranteed that a population parameter lies in between these two numbers. The z-score multiplied with the standard error is also called the margin of error.
So a confidence interval is: point estimate ± margin of error. The confidence level is the chance that the parameter really falls within the confidence interval. This is a number close to 1, like 0.95 or 0.99.
How do you calculate the confidence level for a proportion?
Nominal and ordinal variables create categorical data (for instance 'agree' and 'not agree'). For this kind of data, means are useless. Instead, proportions or percentages are used. A proportion is between 0 and 1, a percentage between 0 and 100.
The unknown population proportion is written: π. The sample proportion is the point estimate of the population proportion, meaning the sample is used to estimated the population proportion. The sample proportion is indicated by the symbol:
\[\hat{\pi}\]
A sample mean is a statistic from the sample so its distribution has the normal shape. Hence, the Central Limit Theorem is in place. Because it is a normal distribution, 95% falls within two standard deviations from the mean. This is regarded as the confidence interval. Calculating a confidence interval requires the standard error, but because this is often unknown for the population, the sample standard error is used instead. This is indicated as se. The formula for estimating the sample standard error is:
\[se=\sqrt{\frac{\hat{\pi}(1-\hat{\pi})}{n}}\]
The standard error needs to be multiplied with the z-score. For a normal distribution the chance of z standard errors from the mean is the same as the confidence level. For confidence intervals of 95% and 99%, the z equals 1.96 and 2.58. A 95% confidence interval for the proportion π is:
\[\hat{\pi}\pm 1.96(se)\]
The general formula for a confidence interval is:
\[\hat{\pi}\pm z(se)\]
Confidence intervals are rounded off at two numbers.
A bigger sample generates a smaller standard error and a more accurate confidence interval. Specifically, the sample size needs to multiply by four to double the accuracy of the confidence interval.
The error probability is the chance that the parameter is outside of the estimated confidence interval. This is indicated as α (the Greek letter alpha), it is calculated as 1 – confidence level. If the confidence level is 0.98, then the error probability is 0.02.
When the sample is too small, the confidence interval doesn't say much because the error probability is too big. As a rule, at least 15 observations should fall within a category and at least 15 outside.
How do you calculate the confidence level for a mean?
Finding the confidence interval for a mean goes roughly the same way as finding it for a proportion. For a mean the confidence interval is point estimate ± margin of error. In this case the margin of error consists of a t-score (instead of a z-score) multiplied with the standard error. The t-score is retrieved from the t-distribution, a distribution of the confidence intervals for all sample sizes, even tiny ones. The standard error is found by dividing the sample standard deviation s by the root of the sample size n. In this case the point estimate is the sample mean ȳ.
The formula for a 95% confidence interval for a population mean µ using the t-distribution is:
\[\bar{y}\pm t_{.025}(se)\]
where
\[se=\frac{s}{\sqrt{n}}\]
and df = n – 1
For t-scores the confidence interval is a little wider than it normally is. The t-distribution looks like a normal distribution but it rises less high in the middle and its tails are a bit higher. It's symmetrical from the middle, where the mean 0 lies.
The standard deviation of the t-distribution is dependent on the degrees of freedom (df). With that, the standard deviation of the t-distribution is a bit bigger than 1. The formula for the degrees of freedom is: df = n – 1.
The bigger the degrees of freedom, the more the t-distribution looks like a normal distribution. It gets pointier. For df > 30 they are practically identical.
The t-scores can be found on the internet or in books about statistics. For instance, a 95% confidence interval has a t-score t0.025.
Robust means that a statistical method will hold even when a certain assumption is violated. Even for a distribution that isn't normal, the t-distribution can give a mean for a confidence level. However, for extreme outliers or very skewed distributions, this method doesn't work properly.
A standard normal distribution is a distribution with degrees of freedom that are infinite.
The t-distribution was discovered by Gosset while doing research for a brewery. He secretly published articles using Student as a name. Now, sometimes the t-distribution is named Student's t.
How do you choose the sample size?
For determining sample size, the desired margin of error and the desired confidence level need to be decided upon. The desired margin of error is indicated as M.
The formula for finding the right sample size to estimate a population proportion is:
\[n=\pi(1-\pi)(\frac{z}{M})^2\]
The z-score corresponds with the one for the chosen confidence interval, like 1.96. The z-score is determined by the chance that the margin of error isn't bigger than M. The sample proportion π can be guessed or can be estimated safely at 0,50.
The formula for finding the right sample size to estimate a population mean is:
\[n=\sigma^2(\frac{z}{M})^2\]
Here the z-score also belongs to the chosen confidence level, like z = 1.96 for 0.95. The standard deviation of the population σ needs to be guessed.
The desired sample size depends on the margin of error and on the confidence level, but also on variability. Data with high variability requires a bigger sample size.
Other factors influence choosing a sample size as well. The more complex the analysis and the more variables are relevant, the bigger the sample needs to be. Time and money also influence things. If it's unavoidable for a sample to be small, then for each category two fake observations are added, so that the formulas for the confidence interval remain useful.
What do maximum likelihood and bootstrap methods do?
Besides means and proportions, other statistics can describe data too. To make point estimates, for other statistics as well, R.A. Fisher developed a method called maximum likelihood. This method chooses the estimator of the parameter for which the likelihood is maximal. The likelihood can be portrayed like a curve, so visually it immediately becomes clear where the highest point of likelihood is located. The chance for finding a sample outcome with a certain value for a parameter shows how likely a parameter value is.
This method has three advantages, especially for big samples: 1) efficiency, other estimators don't have smaller standard errors or are closer to the parameter, 2) unbiased and 3) usually shaped like a normal distribution.
Fisher discovered that the mean is a more likely estimator than the median. Only for exceptions the median is better, like for very skewed data.
When even the shape of a population distribution is unknown, the bootstrap method can help. Software then treats the sample as if it were the population distribution and generates a new 'sample', this process is repeated many times. In this way, the bootstrap method can find the standard error and the confidence interval.
How do you perform significance tests? – Chapter 6
What are the five components of a significance test?
A hypothesis is a prediction that a parameter within the population has a certain value or falls within a certain interval. A distinction can be made between two kinds of hypotheses. A null hypothesis (H0) is the assumption that a parameter will assume a certain value. Opposite is the alternative hypothesis (Ha), the assumption that the parameter falls in a range outside of that value. Usually the null hypothesis means no effect. A significance test (also called hypothesis test or test) finds if enough material exists to support the alternative hypothesis. A significance test compares point estimates of parameters with the expected values of the null hypothesis.
Significance tests consist of five parts:
- Assumption. Each test makes assumptions about the type of data (quantitative/categorical), the required level of randomization, the population distribution (for instance the normal distribution) and the sample size.
- Hypotheses. Each test has a null hypothesis and an alternative hypothesis.
- Test statistic. This indicates how far the estimate lies from the parameter value of H0. Often, this is shown by the number of standard errors between the estimate and the value of H0.
- P-value. This gives the weight of evidence against H0. The smaller the P-value is, the more evidence that H0 is incorrect and that Ha is correct.
- Conclusion. This is an interpretation of the P-value and a decision on whether H0 should be accepted or rejected.
How do you perform a significance test for a mean?
Significance tests for quantitative variables usually research the population mean µ. The five parts of a significance test come to play here.
Assumed is that the data is retrieved from a random sample and it has the normal distribution.
The hypothesis is two-sided, meaning that both a null hypothesis and an alternative hypothesis exist. Usually the null hypothesis is H0: µ = µ0 , in which µ0 is the value of the population mean. This hypothesis says that there is no effect (0). The alternative hypothesis then contains all other values and looks like this: Ha: µ ≠ µ0.
The test statistic is the t-score. The formula is as follows:
\[t=\frac{\bar{y}-\mu_0}{se}\]
where
\[se=\frac{s}{\sqrt{n}}\]
The sample mean ȳ estimates the population mean μ. If H0 is true, then the mean of the distribution of ȳ equals the value of µ0 (and lies in the middle of the distribution of ȳ). A value of ȳ far in the tail of the distribution gives strong evidence against H0. The further ȳ is from µ0 then the bigger is the t-score and the bigger is the evidence against H0.
The P-value indicates how extreme the existing data would be if H0 would be true. The probability that this happens, is located in the two tails of the t-distribution. Software can find the P-value.
To draw conclusions, the P-value needs to be interpreted. If the P-value is smaller, the evidence against H0 is stronger.
For two-sided significance tests the conclusions should be the same for the confidence interval and the significance test. This means that when a confidence interval of µ contains H0 the P-value should be bigger than 0.05. When this interval doesn't contain H0 the P-value is smaller than 0.05.
In two-sided tests the region of rejection is in both tails of the normal distribution. In most cases a two-sided test is performed. However, in some cases the researcher already senses in which direction the effect will go, for instance that a particular type of meat will cause people to gain weight. Sometimes it's physically impossible that the effect will take the opposite direction. In these cases a one-sided test can be used, this is an easier way to test a specific idea. A one-sided test has the region of rejection in only one of its tails, which depends on the alternative hypothesis. If the alternative hypothesis says that there will be weight gain after consumption of a certain product, then the region of rejection is in the right tail. For two-sided tests the alternative hypothesis is Ha: µ ≠ µ0 (so the population mean can be anything but a certain value), for one-sided tests it is Ha: µ > µ0 or Ha: µ < µ0 (so the population mean needs to be either bigger or smaller than a certain value).
All researchers agree that one-sided and two-sided tests are two different things. Some researchers prefer a two-sided test, because it provides more substantial evidence to reject the null hypothesis. Other researchers prefer one-sided tests because they show the outcome of a very specific hypothesis. They say a one-sided test is more sensitive. A tiny effect has a bigger impact on a one-sided test than on a two-sided test. Generally, if the direction of the effect is unknown, two-sided tests are applicable.
The hypotheses are expressed in parameters for the population (such as µ), never in statistics about the sample (such as ȳ), because retrieving information about the population is the end goal.
Usually H0 is rejected when P is smaller or equal to 0.05 or 0.01. This demarcation is called the alpha level or significance level and it is indicated as α. If the alpha level decreases, the research should be more careful and the evidence that the null hypothesis is wrong should be stronger.
Two-sided tests are robust; even when the distribution isn't normal, still confidence intervals and tests using the t-distribution will function. However, significance rests don't work well for one-sided test with a small sample and a very skewed population.
How do you perform a significance test for a proportion?
Significance tests for proportions work roughly similar like significance tests for means. For categorical variables the sample proportion can help to test the population proportion.
In terms of assumptions, it needs to be stated whether it's a random sample with a normal distribution. If the value of H0 is π 0,50 (this means that the population is divided exactly in half, 50-50%), then the sample size needs to be at least 20.
The null hypothesis says that there is no effect, so H0: π = π0. The alternative hypothesis of a two-sided test contains all other values, Ha: π ≠ π0.
The test statistic for proportions is the z-score. The formula for the z-score used as a test statistic for a significance test of a proportion is:
\[z=\frac{\hat{\pi}-\pi_0}{se_0}\]
The z-score measures how many standard errors the sample proportion is from the value of the null hypothesis. This means that the z-score indicates how big the deviation is, how much of the expected effect is observed.
The P-value can be searched with software or found in a table. Also internet apps can find the P-value. The P-value indicates how big the probability is that the observed proportion happens if H0 would be true. For one-sided the probability of a certain value for z is easily found, for two-sided tests the probability needs to be doubled first.
Drawing conclusions works similar for proportion and for means. The smaller the P-value is, the stronger the evidence is against H0. The null hypothesis is rejected when P is bigger than α for an alpha level of around 0,05. Even in case of strong evidence for H0 it will not get accepted by many researchers, they will avoid drawing conclusions that are too big and will just 'not reject' H0.
Which errors can be made in significance tests?
To give people more insight into the findings of a significance test, it's better to give the P-value than to state merely whether the alternative hypothesis was accepted. This is an idea of Fisher. The collection of values for which the null hypothesis is rejected is called the rejection region.
Testing hypotheses is an inferential process. This means that a limited amount of information serves to draw a general conclusion. It's possible that a researcher thinks the null hypothesis should be rejected when the treatment doesn't really have effect. The cause is that samples aren't identical to populations. There can be many parts of a research where an error is created, for instance if an extreme sample happens to be selected. This is called a type I error; when the null hypothesis is rejected while it is true. This can have big consequences. However, there is only a small chance that a type I error occurs. The alpha level shows how big the probability is that type I error occurs, usually not exceeding 5%, sometimes limited to 2.5% or 1%. But smaller alpha levels also create the need to find more evidence to reject the null hypothesis.
A type II error occurs when a researcher doesn't reject the null hypothesis while it is wrong, type I error when the null hypothesis is rejected but it is true. If the probability of type I error decreases, the probability of type II error increases.
If P is smaller than 0.05, then H0 is rejected in case of α = 0,05. For type II error the values of µ0 that don't cause H0 to be rejected in the 95% confidence interval.
Which limitations do significance tests have?
It is important to notice that statistical significance and practical significance are not the same. Finding a significant effect doesn't mean that it's an important find. The size of P simply indicates how much evidence exists against H0, and not how far the parameter lies from H0.
It's misleading to only report research that found significant effects. The same research may have been done 20 times, but only once with a significant effect, which may have been found by coincidence.
A significant effect doesn't say whether a treatment has a big effect. To get a better appreciation of the size of a significant effect, the effect size can be calculated. The difference between the sample mean and the value of the population mean for the null hypothesis (M- µ0) is divided by the population standard deviation. An effect size of 0.2 or less isn't practically significant.
For interpreting the practical consequences of a research the confidence interval is more important than a significance test. Often H0 is only one value while other values might be plausible too. That's why a confidence interval with a spectrum of values gives more information.
Other ways that significance tests can mislead:
- Sometimes results are only reported when they are regarded as statistically significant.
- Statistical significance can be coincidence.
- The P-value is not the probability that H0 is true because it can either be true or false, not something in between.
- Real effects usually are smaller than the effects in research that gets a lot of attention.
Publication bias is when research with small effects isn't even published.
How can you calculate the probability of type II error?
A type II error is located in the range of Ha. Every value within Ha has a P(type II error), a probability that type II error occurs. A type II error is calculated using software. The software then creates sampling distributions for the null hypothesis and for the alternative hypothesis and it compares the area where they overlap. The probability of type II error decreases when the parameter value is further away from the null hypothesis, when the sample gets bigger and when the probability of type I error increases.
The power of a test is the probability that the test will reject the null hypothesis when it is wrong. So the power is about finding an effect that is real. The formula for the power of a certain parameter value is: power = 1 – P (type 2 error). If the probability of a type II error decreases, the power increases.
How is the binomial distribution used in significance tests for small samples?
Estimating proportions with small samples is difficult. For the outcome of a small sample with categorical discrete variables, like tossing a coin, a sampling distribution can be made. This is called the binomial distribution. A binomial distribution is only applicable when:
- Every observation falls within one of two categories.
- The probabilities are the same for every category.
- The observations are independent.
The symbol π is the probability of category 1, the symbol x in this case is the binomial variable. The probability of x observations in category 1 is:
\[P(x)=\frac{n!}{x!(n-x)!}\pi^x(1-\pi)^{n-x}\]
\]The symbol n! Is called n factorial, this is the product of all numbers 1 x 2 x 3 x... x n. The binomial distribution is only symmetrical for π = 0,50. The mean is µ= n π and the standard deviation is:
\[\sigma=\sqrt{n\pi(1-\pi)}\]
So even for tiny samples of less than 10 observations in each category a significance test can be done, but then the binomial distribution is used as a help. H0 is π = 0,50 and Ha is π < 0,50.
How do you compare two groups in statistics? - Chapter 7
What are the basic rules for comparing two groups?
In social science often two groups are compared. For quantitative variables means are compared, for categorical variables proportions. When comparing two groups, a binary variable is created: a variable with two categories (also called dichotomous). For instance for sex as a variable the results are men and women. This is an example of bivariate statistics.
Two groups can be dependent or independent. They are dependent when the respondents naturally match with each other. An example is longitudinal research, where the same group is measured in two moments in time. For an independent sample the groups don't match, for instance in cross-sectional research, where people are randomly selected from the population.
Imagine comparing two independent groups: men and women and the time they spend sleeping. Men and women are two different groups, with two population means, two estimates and two standard errors. The standard error indicates how much the mean differs for each sample. Because we want to investigate the difference, also this difference has a standard error. The population difference is estimated by the sample difference. What you want to know, is µ₂ – µ₁, this is estimated by ȳ2 – ȳ1. This can be shown in a sampling distribution. The standard error of ȳ2 – ȳ1 indicates how much the mean varies between samples. The formula is:
\[\text{estimated standard error}=\sqrt{(se_1)^2+(se_2)^2}\]
In this case se1 is the standard error of group 1 (men) and se2 the standard error of group 2 (women).
Instead of the difference also the ratio can be given. This is especially useful in case of very small proportions.
How do you compare two proportions of categorical data?
The difference between the proportions of two populations (π2 – π1) is estimated by the difference between the sampling proportions:
\[(\hat{\pi_2}-\hat{\pi_1})\]
When the samples are very large, the difference is small.
The confidence interval is the point estimate of the difference ± the t-score multiplied with the standard error. The formula for the group difference is:
\[confidence interval = (\hat{\pi_2}-\hat{\pi_1})\pm z(se)\]
where
\[se=\sqrt{\frac{\hat{\pi_1}(1-\hat{\pi_1})}{n_1}+\frac{\hat{\pi_2}(1-\hat{\pi_2})}{n_2}}\]
When the confidence interval has positive values, that means µ₂ - µ₁ is positive and µ₂ is bigger than µ₁. If the confidence interval has negative values, that means µ₂ is smaller than µ₁. When the outcome is a small confidence interval, that means the groups don't differ much.
For a significance test to compare the proportions of two groups, H0 : π2 = π1. This would mean that the proportion is exactly equal in each group. Another possible H0 is π2 – π1 = 0, which also says that there is no difference. Calculating the z-score and the P-value works in roughly the same way as for one group, but the difference is that it indicates an estimate of the proportion in both groups of the sample. This is called a pooled estimate. This estimate is the same in this case as 2 - 1. With this the standard error can be calculated. For se0, the standard error in case the null hypothesis is true, another formula is used:
\[se_0=\sqrt{\hat{\pi}(1-\hat{\pi})(\frac{1}{n_1}+\frac{1}{n_2})}\]
This can be calculated with software. A clear way to present results is in a contingency table. In a contingency table the categories of the explanatory variable are placed in the rows and the categories of the response variable in the columns. The cells indicate the combinations of findings.
How do you compare two means of quantitative data?
For the two populations means (µ₂ – µ₁) a confidence interval can be calculated using the sampling distribution (ȳ2 – ȳ1), the formula is:
\[(\bar{y_2}-\bar{y_1})\pm t(se)\]
where
\[se=\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}\]
The t-score is the one that fits the chosen confidence level. The degrees of freedom df are usually calculated by software. When the standard deviations and sample sizes are equal for each group, then a simplified formula for the degrees of freedom is: df = (n1 + n2 – 2). The outcome is positive or negative and indicates which of the two groups has a higher mean.
For a significance test for comparing two means, H0 : µ1 = µ2 which implicates the same as H0 : µ₂ – µ₁ = 0.
The formula is:
\[t=\frac{(\bar{y_2}-\bar{y_1})-0}{se}\]
The standard error and the degrees of freedom are the same for a confidence interval for two means. Often researchers are interested in the difference between two groups. Significance tests are used less often just for one group.
How do you compare the means of dependent samples?
Dependent samples compare matched pairs data. Longitudinal research, where the same subjects are measured at different moments in time, repeated measures are done. An example is a crossover study, in which a subject gets a treatment and later another treatment.
When matched pairs are compared, for each pair a variable is created (named yd): difference = observation in sample 2 – observation in sample 1. The sample mean is ȳd. A rule for matched pairs is that the difference between the means is the same as the mean of the different scores.
The significance test is:
\[t=\frac{\bar{y_d}-0}{se}\]
When a significance test is performed over different observations for dependent pairs, it is called the paired difference t-test.
The advantages of dependent samples are:
- Other variables influence both the first and the second sample because the same subjects are used.
- The variability and the standard error are smaller.
What complex methods can be used for comparing means?
Beside the t-test, there are other methods for comparing means. Examples are the assumption of identical standard deviations, randomized block design, effect size and a model.
Using the assumption of identical standard deviations to compare means
For an independent sample it is assumed that null hypothesis entails that the distributions of the response variables are identical. Then the standard deviations and the means are also identical. The estimate of the standard deviation is:
\[s=\sqrt{\frac{(n_1-1)s^2_1+(n_2-1)s^2_2}{n_1+n_2-2}}=\sqrt{\frac{\sum{(y_{i1}-\bar{y_1})^2}=\sum{(y_{i2}-\bar{y_2})^2}}{n_1+n_2-2}}\]
The confidence interval is:
\[(\bar{y_2}-\bar{y_1})\pm t(se)\]
in which
\[se=\sqrt{\frac{s^2}{n_1}+\frac{s^2}{n_2}}=s\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}\]
The degrees of freedom are the combined numbers of observations subtracted by the number of estimated parameters (µ1 en µ2), so this is df = n1 + n2 – 2.
Using a randomized block design to compare means
Another method is the randomized block design. This means that subjects that are alike, are regarded as a pair and only one of the two gets treatment.
Software makes inferences about a variability that is equal for two groups but also in case equal variance is not assumed. So it can be assumed that the population standard deviations are the same (σ1 = σ2), but this isn't necessary. When the sample sizes are (nearly) the same, the results of test statistics are equal for identical variance and non-identical variance. However, when hugely different standard deviations are suspected, this method isn't appropriate. It's better not to use F, a function in software that tests whether the standard deviations are equal, because this isn't robust for non-normal distributions.
Using the effect size to compare means
Another method is using the effect size, using the formula:
\[Effect size = \frac{\bar{y_1}-\bar{y_2}}{s}\]
The outcome is regarded big when it is 1 or bigger. This method is specifically useful if the difference would vary a lot depending on units of measure (like kilometers or miles).
Using a model to compare means
Another way to compare means, is by using a model: a simple approach of the real association between two (or more) variables in the population. A normal distribution with a mean and a standard deviation is written as N(µ, σ). y1 is an observation from group 1 and y2 is an observation from group 2. A model can be:
H0 : y1 has a distribution N(µ, σ1) and y2 has a distribution N(µ, σ2)
Ha : y1 has a distribution N(µ1, σ1) and y2 has a distribution N(µ2, σ2) and µ1 ≠ µ2
This investigates whether the means differ. The standard deviations aren't assumed equal, because that would simplify reality too much, allowing for big mistakes.
What complex methods can be used for comparing proportions?
Even for dependent or very small samples, methods exist to compare proportions. For dependent samples a z-score can be used that compares proportions, or McNemar's test, or a confidence interval. For small samples Fishers exact test applies.
The z-score measures the number of standard errors between the estimate and the value of the null hypothesis. The formula in this case is: (sample proportion – null hypothesis proportion) / standard error.
For paired proportions, McNemar's test applies. The test statistic is:
\[z=\frac{n_{12}-n_{21}}{\sqrt{n_{12}+n_{21}}}\]
Other than a significance test also a confidence interval can research the differences between dependent proportions. The formula is:
\[(\hat{\pi_2}-\hat{\pi_1})\pm z(se)\]
in which
\[se=\frac{1}{n}\sqrt{\frac{(n_{12}+n_{21})-(n_{12}-n_{21})^2}{n}}\]
Fishers exact test is a complex test but it can be performed with software to compare very small samples.
Which nonparametric methods exist for comparing groups?
Parametric methods assume a certain distribution shape, like the normal distribution. Nonparametric methods don't make assumptions about distribution shape.
Nonparametric methods for comparing groups are mostly used for very small samples or very skewed distributions. Examples are the Wilcoxon test, Mann-Whitney test and nonparametric measure of effect size.
Some nonparametric tests assume that the shape of the population distributions are identical. The model for this is:
H0 : y1 and y2 have the same distribution.
Ha : The distributions of y1 and y2 have the same shape, but the distribution of y1 is skewed more upright than that of y2.
The Wilcoxon test uses an ordinal scale, it assigns ranking to the observations.
The Mann-Whitney test compares a bunch of observations of a group with a bunch of observations of another group, for instance two sets of weather forecast by different forecasting companies.
The effect size can be applied to nonparametric distributions. The observations from one group are compared to see if they are for instance higher than the observations from another group.
Another option is treating ordinal variables as quantitative variables. This gives a score to each category. This can be an easier method than treating rankings as ordinal variables.
How do you analyze the association between categorical variables? – Chapter 8
How do you create and interpret a contingency table?
A contingency table contains the outcomes of all possible combinations of categorical data. A 4x5 contingency table has 4 rows and 5 columns. It often indicates percentages, this is called relative data.
A conditional distribution means that the data is dependent on a certain condition and shown as percentages of a subtotal, like women that have a cold. A marginal distribution contains the separate numbers. A simultaneous distribution shows the percentages with respect to the entire sample.
Two categorical variables are statistically independent when the probability that one occurs is unrelated to the probability that the other occurs. So this is when the probability distribution of one variable is not influenced by the outcome of the other variable. If this does happen, they are statistically dependent.
What is a chi-squared test?
When two variables are independent, this gives information about variables in the population. Probably the sample will be similarly distributed, but not necessarily. The variability can be high. A significance test tells whether it's plausible that the variables really are independent in the population. The hypotheses for this test are:
H0: the variables are statistically independent
Ha: the variables are statistically dependent
A cell in a contingency table shows the observed frequency (fo), the number of times that an observation is made. The expected frequency (fe) is the number that is expected if the null hypothesis is true, so when the variables are independent. The expected frequency is calculated by adding the total of a row to a total of a column and then dividing this number by the sample size.
A significance test for independence uses a special test statistic. X2 says how close the expected frequencies are to the observed frequencies. The test that is performed, is called the chi-squared test (of indepence). The formula for this test is:
\[x_2=\sum{\frac{(f_0-f_e)^2}{f_e}}\]
This method was developed by Karl Pearson. When X2 is small, the expected and observed frequencies are close together. The bigger X2, the further they are apart. So this test statistic gives information on the level of coincidence.
A binomial distribution shows the probabilities of outcomes of a small sample with categorical discrete variables, like tossing a coin. This is not a distribution of observations or a sample but a distribution of probabilities. A multinomial distribution is the same, except that it has more than two categories.
The probability distribution of X2 is a multinomial distribution. This is called the chi-squared probability distribution. The symbol χ2 of the chi-squared distribution is alike the letter X2 of the test statistic.
The most important characteristics of the chi-squared distribution are:
- The distribution is always positive, X² can never be negative.
- The distribution is skewed to the right.
- The exact shape of the distribution depends on the degrees of freedom (df). For the chi-suared distribution, µ = df and σ = the root of 2df. The curve gets flatter when df gets bigger.
- If r is the number of rows and c the columns, df = (r – 1)(c – 1).
- When the contingency tables become bigger, so do the degrees of freedom and so does X².
- The stronger X², the stronger the evidence is against H0.
X² is used both for means and proportions. For proportions, research results (such as 'yes' and 'no') can be divided in success and failure. π1 is the proportion of success in group 1, π2 the proportion of success in group 2. When the response variable is independent of the populations, then π1 = π2. This is called a homogenity hypothesis. Chi-quared test is also called a homogenity test. The test statistic is:
\[z=\frac{\hat{\pi_2}-\hat{\pi_1}}{se_0}\]
in which X² = z2
The test statistics z-score and X² are used in different cases. Z-score is applicable for instance for one-sided alternative hypotheses. But for a contingency table larger than 2x2, the X² is better because it can handle multiple parameters. Df can be interpreted as the number of parameters required to describe the contingency table.
Chi-quared test does have limitations. It only works for large sample with an expected frequency higher than 5 per cel. For small samples Fisher's exact test is better. Chi-squared test works best for nominal scales. For ordinal scales other tests are preferred.
In which way do residuals help to analyze the association between variables?
When the P-value of a chi-squared test is very small, then there is strong evidence of an association between the variables. This says nothing about which way that the variables are connected or how strong this association is. That's why residuals are important. A residual is the difference between the observed and expected frequency of a cel: fo – fe. When a residual is positive, the observed frequency is bigger. A standardized residual indicates for which number H0 is true and when there is indepence. The formula for a standardized residual is:
\[z=\frac{f_0-f_e}{se}=\frac{f_0-f_e}{\sqrt{f_e(1-\text{row proportion})(1-\text{column proportion})}}\]
A big standardized residual is the evidence against independence in a certain cell. When the null hypothesis is true, the probability is only 5% that a standardized residual has a value higher than 2. So a residual of under -3 or above 3 is very convincing evidence. Software gives both the test statistic X² and the residuals. In a 2x2 contingency table the standardized residual is the same as the z test statistic for comparing two proportions.
How can the association in a contingency table be measured?
In analyzing a contingency table, research hopes to find out:
- Whether there is an association (measured by chi-squared test)
- How the data is different from indepence (measured by standardized residuals)
- How strong the association is between variables
Several measures of association size up the connection between variables. They compare the most extreme form of an association and the most extreme depletion of it and decide where the data is located in between these two extremes.
The least strong association is for instance in a sample of 60% students and 40% non-students, where 30% of students say they love beer and 30% of non-students say they love beer. This is not a real situation. The most extreme association would be if 100% of students love beer and 0% of non-students. In reality the percentage lies in between.
In a simple binary 2x2 contingency table it's easy to compare proportions. If the association is strong, so is the absolute number of the difference.
Chi-quared test measures only how much evidence is provided of an association. Chi-squared test does not measure how strong an association is. For instance, a large sample can find strong evidence that there a weak association exists.
When the outcome of a binary response variable is labelled success or failure, then the odds can be calculated: odds of succes = probability of success – probability of failure. When the odds are 3, then success is three times as likely as failure. The probability of a certain outcome is odds / (odds + 1). The odds ratio of a 2x2 contingency table compares the odds of a group with the odds of another group: odds of row 1 / odds of row 2. The odds ratio is indicated as θ .
The odds ratio has the following characteristics:
- The value doesn't depend on which variable is chosen as a response variable.
- The odds ratio is the same as multiplying diagonal cells and hence it's also called the cross-product ratio.
- The odds ratio can have any non-negative number.
- When the probability of success is the same for two rows, then the odds ratio is 1.
- An odds ratio smaller than 1 means that the odds of success are smaller for row 1 than for row 2.
- The further the odds ratio is from 1, the stronger the association.
- There can be two values for the odds ratio; two directions.
When a contingency table is more complex than 2x2, then the odds ratio is divided in smaller 2x2 contingency tables. Sometimes it's possible to capture a complex collection of data in a single number, but it's better to present multiple comparisons instead (like multiple odds ratios), to better reflect the data.
How do you measure the association between ordinal variables?
An association between ordinal variables can be positive or negative. A positive association means that a higher score on x goes along with a higher score on y. A negative association means that a higher score on x entails a lower score on y.
A pair of observations can be concordant (C) or disconcordant (D). A pair of observations is concordant when the subject that scores higher for one variable also scores higher for another variable (evidence of a positive association). A pair is disconcordant when the subject that scores higher for one, scores lower for the other (evidence of a negative association).
Because bigger samples have more pairs, the difference is standardized, which gives gamma, noted as ŷ (this is different from y-bar!). Gamma measures the association between variables. Its formula is: ŷ = (C – D) / (C + D).
The gamma value is between -1 and +1. It indicates whether the association is positive or native and how strong the association is. If gamma increases, so does the association. For instance, a gamma value of 0.17 indicates a positive but weak association. Gamma is the difference between ordinal proportions, it's the difference between the proportions of concordant and disconcordant pairs.
Other measures of association are Kendall's tau-b, Spearman's rho-b, and Somers' d. These methods measure the correlation between quantitative variables.
Also gamma can be calculated as a confidence interval. In this case ŷ denotes sample gamma, y population gamma, ŷ ± z(se) the confidence interval in which z = (ŷ – 0) / se. This formula works best if C and D are both higher than 50.
If two variables are ordinal, then an ordinal measure is preferable over chi-squared test, because chi-squared test ignores rankings.
Other ordinal methods work in similar ways like gamma. An alternative is a test of linear-by-linear association, in which each category of each variable is assigned a score and the correlation is analyzed by a z-test. This is a method to detect a trend.
For a mix of ordinal and nominal variables, especially if the nominal variable has more than two categories, it's better not to use gamma.
How do linear regression and correlation work? – Chapter 9
What are linear associations?
Regression analysis is the process of researching associations between quantitative response variables and explanatory variables. It has three aspects: 1) investigating whether an association exists, 2) determining the strength of the association and 3) making a regression equation to predict the value of the response variable using the explanatory variable.
The response variable is denoted as y and the explanatory variable as x. A linear function means that there is a straight line throughout the points of data in a graph. A linear function is: y = α + β (x). In this alpha (α) is the y-intercept and beta (β) is the slope.
The x-axis is the horizontal axis and the y-axis is the vertical axis. The origin is the point where x and y are both 0.
The y-intercept is the value of y when x = 0. In that case β(x) equals 0, only y = α remains. The y-intercept is the location where the line starts on the y-axis.
The slope (β) indicates the change in y for a change of 1 in x. So the slope is an indication of how steep the line is. Generally when β increases, the line becomes steeper.
When β is positive, then y increases when x increases (a positive relationship). When β is negative, then y decreases when x increases (a negative relationship). When β = 0, the value of y is constant and doesn't change when x changes. This results in a horizontal line and means that the variables are independent.
A linear function is an example of a model; a simplified approximation of the association between variables in the population. A model can be good or bad. A regression model usually means a model more complex than a linear function.
What is the least squares prediction equation?
In regression analysis α and β are regarded as unknown parameters that can be estimated using the available data. Each value of y is a point in a graph and can be written with its coordinates (x, y). A graph is used as a visual check whether it makes sense to make a linear function. If the data is U-shaped, a straight line doesn't make sense.
The variable y is estimated by ŷ. The equation is estimated by the prediction equation: ŷ = a + b(x). This line is the best line; the line closest to all data points. In the prediction equation is a = ȳ – bx̄ and:
\[b=\frac{\sum{(x-\bar{x})(y-\bar{y})}}{\sum{(x-\bar{x})^2}}\]
A regression outlier is a data point far outside the trend of the other data points. It's called influential when removing it would cause a big change for the prediction equation. The effect is smaller for large datasets. Sometimes it's better for the prediction equation to leave the outlier out and explain this when reporting the results.
The prediction equation estimates the values of y, but they won't completely match the actual observed values. Studying the differences indicates the quality of the prediction equation. The difference between an observed value (y) and the predicted value (ŷ) is called a residual, this is y – ŷ. When the observed value is bigger, the residual is positive. When the observed value is smaller, the residual is negative. The smaller the absolute value of the residual, the better the prediction is.
The best prediction equation has the smallest residuals. To find it, the SSE (sum of squared errors) is used. SSE tells how good or bad ŷ is in predicting y. The formula of the SSE is:
\[SSE=\sum{(y-\hat{y})^2}\]
The least quares estimates a and b in the least squares line ŷ = a + b(x) have the values for which SSE is as small as possible. It results in the best possible line that can be drawn. In most software SSE is called the residual sum of squares.
The SSE of the best regression line has both negative and positive residuals (that all become positive by squaring them), of which the sum and the mean are 0. The best line intersects the mean of x and the mean of y, so it intersects (x̄, ȳ), the center of the data.
What is a linear regression model?
In y = a + b(x) there is the same sort of y-value for every x-value. This is a deterministic model. Usually this isn't how reality works. For instance when age (x) predicts the number of relationships someone has been in (y), then not everybody has had the same number at age 22. In that case a probabilistic model is better; a model that allows variability in the y-value. The data can then be visualized in a conditional distribution, a distribution that has the extra condition that x has a certain value.
A probabilistic model shows the mean of the y-values, not the actual values. The formula of a conditional distribution is E(y) = α + β (x). The symbol E means the expected value. When for instance people aged 22 have had different numbers of relationships, the probabilistic model can predict the mean number of relationships.
A regression function is a mathematical equation that describes how the mean of the response variable changes when the value of the explanatory variable changes.
Another parameter of the linear regression model is the standard deviation of a conditional distribution, σ. This parameter measures the variability of the y-values for all person with a certain x-value. This is called the conditional standard deviation.
Because the real standard deviation is unknown, the sample standard deviation is used:
\[s=\sqrt{\frac{SSE}{n-2}}\]
The assumption is made that the standard deviation is the same for every x-value. If the variability would differ per distribution of a value of x, then s would indicate the mean variability. The Mean Square Error (MSE) is s squared. In software the conditional standard deviation has several names: Standard error of the estimate (SPSS), Residual standard error (R ), Root MSE (Stata and SAS).
The degrees of freedom for a regression function are df = n – p, in which p is the number of unknown parameters. In E(y) = α + β (x) there are two unknown parameters (α and β) so df = n – 2.
The conditional standard deviation depends both on y and on x and is written as σy|x (for the population) and sy|x (for the sample), shortened σ and s. In a marginal distribution the standard deviation only depends on y, so this is written as σy (for the population) and sy (for the sample). The formula of a point estimate of the standard deviation is:
\[\sqrt{\frac{\sum{(y-\hat{y})^2}}{n-1}}\]
The upper part in the root, Σ (y – ȳ)2, is the total sum of squares. The marginal standard deviation (independent of x) and the conditional standard deviation (dependent on a certain x) can be different.
How does the correlation measure the association of a linear function?
The slope tells how steep a line is and whether the association is negative or positive, but it doesn't tell how strong the association between two variables is.
The association is measured by the correlation (r). This is standardized version of the slope. It is also called the standardized regression coefficient or Pearson correlation. The correlation is the value that the slope would have if the variables would have an equal variability. The formula is:
\[r=\frac{\sum{(x-\bar{x})(y-\bar{y})}}{\sqrt{[\sum{(x-\bar{x})^2}][(y-\bar{y})^2]}}\]
In regard to the slope (b), the r is: r = (sx / sy) b, in which sx is the standard deviation of x and sy is the standard deviation of y.
The correlation has the following characteristics:
- It can only be used if a straight line makes sense.
- It lies between 1 and -1.
- It is positive/negative, the same as b.
- If b is 0, then r is 0, because then there is no slope and no association.
- If r increases, then the linear association is stronger. If r is exactly -1 or 1, then the linear association is perfectly negative or perfectly positive, without errors.
- The r does not depend on units of measurement.
The correlation implies regression towards the mean. This means that when r increases, the association is stronger between the standard deviation of x and the proportion of the standard deviation of y.
The coefficient of determination r2 is r-squared and it indicates how good x can predict y. It measures how good the least squares line ŷ = a + b(x) predicts y compared to the prediction of ȳ.
The r2 has four elements;
- Rule 1: y is predicted, no matter what x is. The best prediction then is the sample mean ȳ.
- Rule 2: y is predicted by x. The prediction equation ŷ = a + b(x) predicts y.
- E1 are the errors of rule 1 and E2 are the errors of rule 2.
- The proportional limit of the number of errors is the coefficient of determination: r2 = (E1 - E2) / E1 in which E1 = Σ (y – ȳ)2, this is the total sum of squares (TSS). In this E2 = Σ (y – ŷ)2, this is the SSE.
R-squared has a number of characteristics similar to r:
- Because r is between 1 and -1, the r2 needs to be between 0 and 1.
- When SSE = 0, then r2 = 1. All points are on the line.
- When b = 0, then r2 = 0.
- The closer r2 is to 1, the stronger the linear association is.
- The units of measurement and which variable is the explanatory one (x or y), don't matter for r2.
The TSS describes the variability in the observations of y. The SSE describes the variability of the prediction equation. The coefficient of determination indicates how many % the variance of a conditional distribution is bigger or smaller than that of a marginal distribution. Because the coefficient of determination doesn't use the original scale but a squared version, some researcher prefer the standard deviation and the correlation because the information they give is easier to interpret.
How do you predict the slope and the correlation?
For categorical variables, chi-squared test is used to test for independence. For quantitative variables, the confidence test of the slope or of the correlation provides a test for independence.
The assumptions for inference applied to regression are:
- Randomization
- The mean of y is approximated by E(y) = α + β (x)
- The conditional standard deviation σ is equal for every x-value
- The conditional distribution of y for every x-value is a normal distribution
The null hypothesis is H0 : β = 0 (in that case there is no slope and the variables are independent), the alternative hypothesis is Ha : β ≠ 0.
The t-score is found by dividing the sample slope (b) by the standard error of b. The formula is t = b / se. This formula is similar to the formula for every t-score; the estimate minus the null hypothesis (0 in this case), divided by the standard error of the estimate. You can find the P-value for df = n – 2. The standard error of b is:
\[se=\frac{s}{\sqrt{\sum{(x-\bar{x})^2}}}\]
where
\[s=\sqrt{\frac{SSE}{n-2}}\]
The smaller the standard deviation s, the more precise b estimates β.
The correlation is denoted by the Greek letter ρ. The ρ is 0 in the same situations in which β = 0. A test whether H0 : ρ = 0 is performed in the same way as a test for the slope. For the correlation the formula is:
\[t=\frac{r}{\sqrt{\frac{1-r^2}{n-2}}}\]
When many variables possibly influence a response variable, these can be portrayed in a correlation matrix. For each variable the correlation can be calculated.
A confidence interval gives more information about a slope than an independence test. The confidence interval of the slope β is: b ± t(se).
Calculating a confidence interval for a correlation is more difficult, because the sampling distribution isn't symmetrical unless ρ = 0.
R2 indicates how good x predicts y and it depends on TSS (the variability of the observations of y) and SSE (the variability of the prediction equation). The difference, TSS – SSE, is called the regression sum of squares or the model sum of squares. This difference is the total variability in y that is explained by x using the least squares line.
What happens when the assumptions of a linear model are violated?
Often the assumption is made that a linear association exists. It's important to check the data in a scatterplot first to see whether a linear model makes sense. If the data is U-shaped, then a straight line doesn't make sense. Making this error could cause the result of an independence test of the slope to be wrong.
Other assumptions are that the distribution is normal and that σ is identical for every x-value. Even when the distribution isn't normal, then the least squares line, the correlation and the coefficient of determination are still useful. But if the standard deviation isn't equal, then other methods are more efficient than the least squares line.
Some outliers have big effects on the regression lines and the correlations. Sometimes outliers need to be taken out. Even one point can have a big influence, particularly for a small sample.
The assumption of randomization, both for x and y, is important for the correlation. If there is no randomization and the variability is small, then the sample correlation will be small and it will underestimate the population correlation. For other aspects of regression, like the slope, the assumption of randomization is less important.
The prediction equation shouldn't be extrapolated and used for (non-existent) data points outside of the range of the observed data. This could have absurd results, like things that are physically impossible.
The theoretical risk exists that the mean of y for a certain value of x doesn't estimate the actual individual observation properly. The Greek letter epsilon (ε) denotes the error term; how much y differs from the mean. The population model is y = α + β x + ε and the sample prediction equation is y = a + bx + e. The ε is also called the population residual.
A model is only an approximation of reality. It shouldn't be too simple. If a model is too simple, it should be adjusted.
What type of multivariate relationships exist? – Chapter 10
How does causality relate to associations?
Many scientifical studies research more than two variables, requiring multivariate methods. A lot of research is focussed on the causal relationship between variables, but finding proof of causality is difficult. A relationship that appears causal may be caused by another variable. Statistical control is the method of checking whether an association between variables changes or disappears when the influence of other variables is removed. In a causal relationship, x → y, the explanatory variable x causes the response variable y. This is asymmetrical, because y does not need to cause x.
There are three criteria for a causal relationship:
- Association between the variables
- Appropriate time order
- Elimination of alternative explanations
An association is required for a causal relationship but it doesn't necessitate it. Usually it immediately becomes clear what a logical time order is, such as an explanatory variable preceding a response variable. Apart from x and y, extra variables may provide an alternative explanation. In observational studies it can almost never be proved that a variable causes another variable, this isn't certain. Sometimes there can be outliers or anecdotes that contradict causality, but usually a single anecdote isn't enough proof to contradict causality. It's easier to find causality with randomized experiments than with observational studies. This is because randomization appoints two groups randomly and sets the time frame before starting the experiment.
How do you control whether other variables influence a causal relationship?
Eliminating alternative explanations is often tricky. A method of testing the influence of other variables is controlling them; eliminating them or keeping them on a constant value. Controlling means taking care that the control variables (the other variables) don't have an influence anymore on the association between x and y. A random experiment in a way also uses control variables; the subjects are selected randomly and the other variables manifest themselves randomly in the subjects.
Statistical control is different from experimental control. In statistical control, subjects with certain characteristics are grouped together. Observational studies in social science often form groups based on socio-economic status, education or income.
The association between two quantitative variables is shown in a scatter plot. Controlling this association for a categorical variable is done by comparing the means.
The association between two categorical variables is shown in a contingency table. Controlling this association for a third variable is done by showing each value of the third variable in a separate contingency table, called a partial table.
Usually the effect of a control variable isn't completely absent, it's just minimal.
A lurking variable is a variable that isn't measured, but that does influence the causal relationship. Sometimes researchers don't know about the existence of a variable.
Which types of multivariate relationships exist?
In multivariate relationships, the response variable y has multiple explanatory variables and control variables, written as x1, x2, etc.
In spurious associations, both the explanatory variable x1 and the response variable y depend on a third variable (x2),. The association between x1 and y disappears when x2 is controlled. There is no causal relationship between x1 and y.
In chain relationships the explanatory variable (x1) causes a third variable (x2), that in turn causes the response variable (y). The third variable (x2) is also called the intervening variable or the mediator. Also in chain relationships the association disappears when x2 is controlled:
\[x_1\rightarrow x_2\rightarrow y\]
The difference between a spurious relationship and a chain relationship is the causal order. In a spurious relationship x2 precedes both x1 and y. In a chain relationship x2 intervenes between x1 and y.
In reality, response variables often have more than one cause. Then y is said to have multiple causes. Sometimes these causes are independent, but usually they are connected. That means that for instance x1 has a direct effect on y but also an indirect effect on y via x2.
In case of a suppressor variable, there seems to be no association between x1 and y, until x2 is controlled and disappears. Then x2 is a suppressor variable. This happens when for example x2 is positively correlated with y and negatively correlated with x1. So even when there seems to be no association between two variables, it's wise to control for other variables.
Statistical interaction happens between x1 and x2 and their effect on y when the actual effect of x1 on y changes for different values of x2. The explanatory variables, x1 and x2, are also called predictors.
Lots of structures are possible for multivariate associations. One of the possibilities is even an association that assumes the opposite direction (positive versus negative) when a variable is controlled, this is called Simpson's paradox.
Confounding happens when two explanatory variables both effect a response variable and they're also associated with each other. Omitted variable bias is a risk when a confounding variable is overseen. Finding confounding variables is a big challenge for social science.
What are the consequences of statistical control for inference?
When x2 is controlled for the x1y association, this may have consequences for inference. A certain value of x2 can shrink the sample size. The confidence interval becomes wider and the test statistics smaller. The chi squared test can result in a smaller value, caused by the smaller sample size.
When a categorical variable is controlled, separate contingency tables need to be construed for the different categories. It is usual for an ordinal variable to require at least three or four tables.
Often the parameter values are measured for several values of the control variable. Instead of the usual confidence interval to analyze the difference between either proportions or means, a confidence interval can be calculated for the difference in parameters for several values of the control variables. The formula for measuring the effect of statistical control through a confidence interval is:
\[(Estimate_2-Estimate_1)\pm z\sqrt{(se_1)^2+(se_2)^2}\]
When 0 isn't within the interval, then the parameter values are different. When the x1y association is equal in the partial analyses, then a measure is designed for the strength of the association, in keeping with the control variable. This is called a partial association.
What is multiple regression? – Chapter 11
What does a multiple regression model look like?
A multiple regression model has more than one explanatory variable and sometimes also (a) controle variable(s): E(y) = α + β1x1 + β2x2. The explanatory variables are numbered: x1, x2, etc. When an explanatory variable is added, then the equation is extended with β2x2. The parameters are α, β1 and β2. The y-axis is vertical, x1 is horizontal and x2 is perpendicular to x1. In this three-dimensional graph the multiple regression equation describes a flat surface, called a plane.
A partial regression equation describes only part of the possible observations, only those with a certain value.
In multiple regression a coefficient indicates the effect of an explanatory variable on a response variable, while controlling for other variables. Bivariate regression completely ignores the other variables, multiple regression only brushes them aside for a bit. This is the basic difference between bivariate and multiple regression. The coefficient (like β1) of a predictor (like x1) tells what is the change in the mean of y when the predictor is raised by one point, controlling for the other variables (like x2). In that case, β1 is a partial regression coefficient. The parameter α is the mean of y when all explanatory variables are 0.
The multiple regression model has its limitations. An association doesn't automatically mean that there is a causal relationship, there may be other factors. Some researchers are more careful and call statistical control 'adjustment'. The regular multiple regression model assumes that there is no statistical interaction and that the slope β doesn't depend on which combination of explanatory variables is formed.
Multiple regression that exists in the population is estimated by the prediction equation : ŷ = a + b1 x1 + b2 x2 + … + b p x p in which p is the number of explanatory variables.
Just like the bivariate model, the multiple regression model uses residuals to measure prediction errors. For a predicted response ŷ and a measured response y, the residual is the difference between them: y – ŷ. The SSE (Sum of Squared Errors/Residual Sum of Squares) is similar as for bivariate models: SSE = Σ (y – ŷ)2, the only difference is the fact that the estimate ŷ is shaped by multiple explanatory variables. Multivariate models also use the least squares line, with the smallest possible SSE (which indicates how good or bad ŷ is in estimating y).
To check for linearity, multiple regression is plotted in a scatterplot matrix, a mosaic with scatterplots of the data points of several pairs of variables. Another option is to mark the different pairs in a single scatterplot. Software can create a partial regression plot, also called added-variable plot. This graph compares the residuals of different pairs and shows the relationship between the response variable and the explanatory variable after removing the effects of the other predictors.
How do you interpret the coefficient of determination for multiple regression?
For multiple regression, the sample multiple correlation, R, is the correlation between the observed and predicted y-values. R is between 0 and 1. When the correlation increases, so does the strength of the association between y and the explanatory variables. Its square, the multiple coefficient of determination, R2, measures the proportion of the variance in y that is explained by the predictive power of all explanatory variables. It has elements similar to the bivariate coefficient of determination:
- Rule 1: y is predicted no matter what xp is. Then the best prediction is the sample mean ȳ.
- Rule 2: y is predicted by xp. The prediction equation ŷ = a + b1x1 + b2x2 + … + bpxp predicts y.
- The multiple coefficient of determination is the proportional limit of the number of errors: R2 = (TSS – SSE) / TSS in which TSS = Σ (y – ȳ)2 and SSE = Σ (y – ŷ)2.
Software like SPSS shows the output in an ANOVA table. The TSS is listed behind Total, under Sum of Squares and the SSE behind Residual, under Sum of Squares.
Characteristics of R-squared are:
- R2 is between 0 and 1.
- When SSE = 0, then R2 = 1 and the predictions are perfect.
- When b1, b2, …, bp = 0 then R2 = 0.
- When R2 increases, the explanatory variables predict y better.
- R2 can't decrease when explanatory variables are added.
- R2 is at least as big as the r2-values for the separate bivariate models.
- R2 usually overestimates the population value, so software also offers an adjusted R2.
In case there are already a lot of strongly correlated explanatory variables, then R² changes little for adding another explanatory variable. This is called multicollinearity. Problems with multicollinearity are smaller for larger samples. Ideally the sample is at least ten times the size of the number of explanatory variables.
How do you predict the values of multiple regression coefficients?
Significance tests for multiple regression can either check whether the collective of explanatory variables is related to y, or check whether the individual explanatory variables significantly effect y. In a collective significance test H0 : β1 = β2 = … = βp = 0 and Ha : (at least one of) βi ≠ 0 (i means any). This test measures whether the multiple correlation of the population is 0 or something else. The F-distribution is used for this significance test, resulting in the test statistic F:
\[F=\frac{\frac{R^2}{p}}{\frac{(1-R^2)}{[n-(p+1)]}}\]
In this p is the number of predictors (explanatory variables). The F-distribution only has positive values, is skewed to the right, and averages at 1. The bigger R², the bigger F and the bigger the evidence against H0.
The F-distribution depends on two kinds of degrees of freedom: df1 = p (the number of predictors) and df2 = n – (p + 1). SPSS indicates F separately in the ANOVA table and P under Sig. (in R under p-value, in Stata under Prob > F and in SAS under Pr > F).
A significance test whether an individual explanatory variable (xi) has a partial effect on y, tests whether H0 : β i = 0 or Ha : βi ≠ 0. The confidence interval for βi is bi ± t(se) in which t = bi / se. In case of multicollinearity the separate P-values may not indicate correlations, while a collective significance test would clearly indicate a correlation.
For controlled explanatory variables, the conditional standard deviation is estimated by:
\[s=\sqrt{\frac{\sum{(y-\hat{y})^2}}{n-(p+1)}}=\sqrt{\frac{SSE}{df}}\]
Software also calculates the conditional variance, called the error mean square (MSE) or residual mean square.
An alternative calculation for F uses the mean squares from the ANOVA table in SPSS. Then F = regression mean square / MSE in which regression mean square = regression sum of squares (in SPSS) / df1.
The t-distribution and the F-distribution are related, but F lacks information about the direction of an association and F is not appropriate for onesided alternative hypothesis.
How does a statistical model represent interaction effects?
Statistical interaction often happens in multiple regression: the interaction between x1 and x2 and their effect on y when the actual effect of x1 on y changes for different x2-values. A model using cross-product terms shows this interaction: E(y) = α + β1x1 + β2x2 + β3x1x2. A significance test with a null hypothesis H0 : β3 = 0 shows whether there is interaction. For little interaction, the cross-product term is better left out. For much interaction, it doesn't make sense anymore to do significance test for the other explanatory variables.
Coefficients often have limited use because they only indicate the effect of a variable when the other variables are constant. Coefficients become more useful by centering them around 0 by subtracting the mean. It is indicated by the symbol C:
\[E(y)=\alpha + \beta_1 x_1^C + \beta_2 x_2^C + \beta_3 x_1^C x_2^C = \alpha + \beta_1(x_1-\mu_{x_1})+\beta_2(x_2-\mu_{x_2})+\beta_3(x_1-\mu_{x_1})(x_2-\mu_{x_2})\]
Now the coefficient of x1 (so β1) shows the effect of x1 when x2 is at its mean. These effects are similar to the effects in a model without interaction. The advantages of centering are that the estimates of x1 and x2 give more information and that the standard errors are similar to those of a model without interaction.
How do you compare possible regression models?
Reduced models (showing only some variables) can be better than complete models (showing all variables). For a complete model E(y) = α + β1x1 + β2x2 + β3x3 + β4x1x2 + β5x1x3 + β6x2x3 , the reduced version is: E(y) = α + β1x1 + β2x2 + β3x3. The null hypothesis says that the models are identical: H0 : β4 = β5 = β6 = 0.
A comparison method is to subtract the complete model SSE (SSEc) from the reduced model SSE (SSEr). Because the reduced model is more limited, its SSE will always be bigger and be a less accurate estimate of reality. Another comparison method subtracts the different R2-values. The equations are:
\[F=\frac{\frac{(SSE_r-SSE_c)}{df_1}}{\frac{SSE_c}{df_2}}=\frac{\frac{(R_c^2-R^2_r)}{df_1}}{\frac{(1-R_c^2)}{df_2}}\]
Df1 are the number of extra predictors in the complete model and df2 are the other degrees of freedom. A big difference in SSE or a big R2 means a bigger F and smaller P, so more evidence against H0.
How do you calculate the partial correlation?
The partial correlation is the strength of the association between y and the explanatory variable x1 while controlling for x2 :
\[r_{yx_1*x_2}=\frac{r_{yx_1}-r_{yx_2}r_{x_1x_2}}{\sqrt{(1-r_{yx_2}^2)(1-r_{x_1x_2}^2)}}\]
In the partial correlation ryx1.x2 , the variable on the right side of the dot is the control variable. A first order partial correlation has one control variable, a second order partial correlation has two. The characteristics are similar to regular correlations; the value is between -1 and 1 and the bigger it is, the stronger the association.
The partial correlation also has a squared version:
\[r_{yx_2*x_1}^2=\frac{R^2-r_{yx_1}^2}{1-r_{yx_1}^2}=\frac{\text{Partial proportion explained by}x_2}{\text{Proportion explained by}x_1}\]
The squared partial correlation is the proportion of the variance in y that is explained by x1. The variance in y exists of a part explained by x1, a part explained by x2, and a part that is not explained by these variables. The combination of the parts explained by x1 and x2 is R2. Also when more variables are added, R2 is the part of the variance in y that is explained.
How do you compare the coefficients of variables with different units of measurement by using standardized regression coefficients?
The standardized regression coefficient (β*1, β*2, etc) is the change in the mean of y for an added 1 standard deviation, measured in standard deviations instead of other units of measurement. The other explanatory variables are controlled. This compares whether an increase in x1 has a bigger effect on y than an increase in x2. The standardized regression coefficient is estimated by standardizing the regular coefficients:
\[b_1 *=b_1(\frac{s_{x_1}}{s_y}), b_2 *=b_2(\frac{s_{x_2}}{s_y}), ...\]
In this, sy is the sample standard deviation of y and sx1 is the sample standard deviation of an explanatory variable. In SPSS and other software, the standardized regression coefficients are sometimes called BETA (beta weights). Just like the correlation, they indicate the strength of an association, but in a comparative way. When the value exceeds 1, the explanatory variables are highly correlated.
For a variable y the zy is the standardized version; the version expressed in the number of standard deviations. When zy = (y – ȳ) / sy, then its estimate is: ẑy = (ŷ – ȳ) / sy. The prediction equation estimates how far an observation falls from the mean, measured in standard deviations.
What is ANOVA? – Chapter 12
How do dummy variables replace categories?
For analyzing categorical variables without assigning a ranking, dummy variables are an option. This means that fake variables are created from observations:
- z1 = 1 and z2 = 0 : observations of category 1 (men)
- z1 = 0 and z2 = 1 : observations of category 2 (women)
- z1 = 0 and z2 = 0 : observations of category 3 (transgender and other identities)
The model is: E(y) = α + β1z1 + β2z2. The means are deducted from the model: μ1 = α + β1 and μ2 = α + β2 and μ3 = α. Three categories only require two dummy variables, because what remains falls in category 3.
A significance test using the F-distribution tests whether the means are the same. The null hypothesis H0 : μ1 = μ2 = μ3 = 0 is the same as H0 : β1 = β2 = 0. A small F means a big P and much evidence against the null hypothesis.
The F-test is robust against small violations of normality and differences in the standard deviations. However, it can't handle very skewed data. This is why randomization is important.
How do you make multiple comparisons of means?
A small P doesn't say which means differ or how much. Confidence intervals give more information. For every mean a confidence interval can be constructed, or for the difference between two means. An estimate of the difference in population means is:
\[(\bar{y_i}-\bar{y_j})\pm ts\sqrt{\frac{1}{n_i}+\frac{1}{n_j}}\]
The degrees of freedom of the t-score are df = N – g, in which g is the number of categories and N is the combined sample size (n1 + n2 + … + ng). When the confidence interval doesn't contain 0, this is proof of difference between the means.
In case of lots of groups with equal population means, it might happen that a confidence interval finds a difference anyway, due to the increase in errors that comes with the increase in the number of comparisons. Multiple comparison methods check the probability that all intervals of a lot of comparisons contain the real differences. For a 95% confidence interval the probability that one comparison contains an error is 5%, this is the multiple comparison error rate. One such method is the Bonferroni method, which divides the desired error rate by the number of comparisons (5% / 4 comparisons = 1,25% per comparison). Another option is Tukey's method, this method can be calculated with software and uses the so-called Studentized range, a special kind of distribution. The advantage of Tukey's method is that it gives more specific confidence intervals than Bonferroni's method.
What is one-way ANOVA?
Analysis of variance (ANOVA) is an inferential method to compare the means of multiple groups. This is an independence test between a quantitative response variable and a categorical explanatory variable. The categorical explanatory variables are called factors in ANOVA. The test is basically an F-test. The assumptions are the same: normal distribution, equal standard deviations for the groups and independent random samples. The null hypothesis is H0 : μ1 = μ2 = … = μg and the alternative hypothesis is Ha : at least two means differ.
The F-test uses two measures of variance. The between-groups estimate is the variability between each sample mean ȳi and the general mean ȳ. The within-groups estimate is the variability within each group; within ȳ1, ȳ2, etc. This is an estimate of the variance σ2. Generally, the bigger the variability between the sample means and the smaller the variability within the groups, the more evidence that the population means are inequal. This is the equation for F: between-groups estimate / within-groups estimate. When F increases, P decreases.
In an ANOVA table the mean squares (MS) are the between-groups estimate and the within-groups estimate, these are estimates of the population variance σ2. The between-groups estimate is the sum of squares between the groups (the regression SS) divided by df1. The within-groups estimate is the sum of squares within the groups (the remaining SS, or SSE) divided by df2. Together the SS between the groups and the SSE are the TSS; total sum of squares.
The degrees of freedom of the within-groups estimate are: df2 = N (total sample size) – g (number of groups). The estimate of variance by the within-groups sum of squares is:
\[s^2=\frac{\text{Within-groups sum of squares}}{df}=\frac{\text{Within-groups}SS}{N-g}=\frac{(n_1-1)s_1^2+(n_2-1)s_2^2+...+(n_g-1)s_g^2}{N-g}\]
The degrees of freedom of the between-groups estimate are: df1 = g – 1. The variance by the between-groups sum of squares is:
\[\sigma^2=\frac{\sum_i{n_i(\bar{y_i}-\bar{y})^2}}{g-1}=\frac{n_1(\bar{y_1}-\bar{y})^2+...+n_g(\bar{y_g}-\bar{y})^2}{g-1}\]
When this value increases, the population mean is further from the null hypothesis and the difference between the means increases.
For a distribution very different from the normal distribution, the nonparametric Kruskal-Wallis test is an option, this test ranks the data and also works for distributions far from normal.
What is two-way ANOVA?
One-way ANOVA works for a quantitative dependent variable and the categories of a single explanatory variable. Two-way ANOVA works for multiple categorical explanatory variables. Each factor has a null hypothesis to measure the main effects of an individual factor on the response variable, while controlling for the other variable. The main effect of a factor is: MS / residual MS. The MS is calculated by dividing the sum of squares by the degrees of freedom. Because two-way ANOVA is complex, software is used that shows the MS and the degrees of freedom in an ANOVA table.
ANOVA can be done by creating dummy variables. For instance in research about the groceries spendings of vegetarians, taking into account how someone identifies:
v1 = 1 if the subject is vegetarian, 0 if the subject isn't
v2 = 1 if the subject is vegan, 0 if the subject isn't
If someone is vegetarian nor vegan, then they fall in the remaining category (meat eaters).
k = 1 if the subject identifies as budget-minded, 0 if the subject doesn't
Then the model is: E(y) = α + β1v1 + β2v2 + β3k. The prediction equation can be deduced. A confidence interval indicates the difference between the effects.
In reality, two-way ANOVA needs to be checked for interaction effects first, using an expanded model: E(y) = α + β1v1 + β2v2 + β3k.+ β4(v1 x k) + β5(v2 x k).
The sum of squares of one of the (dummy) variables is called the partial sum of squares or Type III sum of squares. This is the variability in y that is explained by a certain variable when the other aspects are already in the model.
ANOVA with multiple factors is factorial ANOVA. The advantage of factorial ANOVA and two-way ANOVA compared to one-way ANOVA is that it's possible to study the interaction of effects.
How does ANOVA with repeated measures work?
Within research, sometimes samples depend on each other, like with repeated measures in different moments of time but using the same subjects. Then each subject is a factor. This may result in three pairs of means (for instance before, during and after treatment), requiring multiple comparison methods. The Bonferroni method divides the margin of error over several confidence intervals.
An assumption of ANOVA with repeated measures is sphericity. This means that the variances of the differences between all possible pairs of explanatory variables are the same. If even the standard deviations and correlations are the same, then there is compound symmetry. Software tests for sphericity with Mauchly's test. If sphericity is lacking, then software uses the Greenhouse-Geisser adjustment of the degrees of freedom to allow for an F-test.
The advantage of using the same subjects is that certain factors are constant, this is called blocking.
Factors with a selected number of outcomes are fixed effects. Random effects are the randomly happening output of factors, like the characteristics of random people that happen to become research subjects.
How does two-way ANOVA with repeated measures of a factor work?
In research with repeated measures, more fixed effects can be involved. An example of a within-subjects factor is time (before/during/after treatment), because it requires the same subjects. The subjects are crossed with the factor. Something else is a between-subjects factor, for example the kind of treatment, because it compares the experiences of different subjects. Then subjects are nested in the factor.
Due to these two kinds of factors, the SSE consists of two kinds of errors. To analyze every difference between two categories, a confidence interval is required. With the two kinds of errors, residuals can't be used. What can be used instead, are multiple one-way ANOVA F-tests with Bonferroni's method.
Multivariate analysis of variance (MANOVA) is a method that can handle multivariate responses and that makes less assumptions. The disadvantage of making less assumptions is that it has a weaker power.
A disadvantage of repeated measures in general is that it requires data from all subjects in all moments. A model that has both fixed effects and random effects is called a mixed model.
How does multiple regression with both quantitative and categorical predictors work? – Chapter 13
What do models with both quantitative and categorical predictors look like?
Multiple regression is also feasible for a combination of quantitative and categorical predictors. In a lot of research it makes sense to control for a quantitative variable. A quantitative control variable is called a covariate and it is studied using analysis of covariance (ANCOVA).
A graph helps to research the effect of quantitative predictor x on the response y, while controlling for the categorical predictor z. For two categories, z can be the dummy variable, else more dummy variables are required (like z1 and z2). The values of z can be 1 ('agree') or 0 ('don't agree'). If there is no interaction, the lines that fit the data best are parallel and the slopes are the same. It's even possible that the regression lines are exactly the same. But if they aren't parallel, there is interaction.
The predictor can be quantitative and the control variable can be categorial, but this can also be the other way around. Software compares the means. A regression model with three categories is: E(y) = α + βx + β1z1 + β2z2, in which β is the effect of x on y for all groups z. For every additional quantitative variable a βx is added. For every additional categorical variable a dummy variable is added (or several, depending on the number of categories). Cross-product terms are added in case of interaction.
Which inferential methods are available for regression with quantitative and categorical predictors?
The first step to making predictions is testing whether a model needs to include interaction. An F-test compares a model with cross-product terms to a model without. For this the F-test uses the partial sum of squares; the variability in y that is explained by a certain variable when the other aspects are already accounted for. The null hypothesis says that the slopes of the cross-product terms are 0, the alternative hypothesis says that there is interaction.
Another F-test checks whether a complete or a reduced model is better. To compare a complete model (E(y) = α + βx + β1z1 + β2z2) with a reduced model (E(y) = α + βx), the null hypothesis is that the slopes β1 and β2 both are 0. The complete model consists of three parallel lines, the reduced model only has one line. When P is small, then there is much evidence against the null hypothesis and then the complete model fits the data significantly better. The multiple coefficient of determination R2 indicates how well the possible regression lines predict y and helps compare the complete with the reduced model.
In what kind of case studies is multiple regression analysis required?
Case studies often start with the desire to research the effect of an explanatory variable on a response variable. Throughout the research, predictors are added, sometimes confounding predictors, sometimes mediating predictors.
How do you use adjusted means?
An adjusted mean or least squares mean is the mean of y for a group while controlling for the other variables in the model. The other variables are kept at a mean, so the value of the adjusted mean can be researched. When an outlier has too big of an influence on the mean, this outlier can be left out and the adjusted mean can be calculated.
The adjusted mean is indicated with an accent. The adjusted sample mean of group i is:
\[\bar{y_i'}\]
The coefficients equal the differences between the adjusted means. Due to the adjusted mean, the regression line of the sample mean shifts upward or downward. The Bonferroni method allows multiple comparisons of adjusted means using confidence intervals with a shared error rate.
Adjusted means are less appropriate if the means for x are very different. Using adjusted means only should be done if it makes sense that certain groups would be distributed in a certain way and if the linear shape is unchanged.
What does a linear mixed model look like?
Factors with a limited number of outcomes (like vegetarians, vegans and meat eaters) are fixed effects. Random effects on the other hand are factors of which the outcomes happen randomly (like the characteristics of research subjects). Linear mixed models have explanatory variable with both fixed effects and random effects.
A regular regression model can express the equation per subject, for instance with the value xi1 of variable x for subject i: yi = α + β1xi1 + β2xi2 + … + βpxip + ϵi. The error term ϵ is the variability of the responses of subjects for certain values of the explanatory variables. The sample value of this is the residual for subject i. Because the error term is expected to be 0, it is removed from the equation of E(yi).
A linear mixed model can handle multiple correlated observations per subject: yij = α + β1xij1 + β2xij2 + … + βpxijk + si + ϵ ij. In this yij is observation j (at a certain time) of subject i. For variable x1 the observation j of subject i is written as xij1 and a random effect of subject i is si. A subject with a high positive si has relatively high responses for each j. The fixed effects are the parameters (β1 etc).
The structure gives information about the character of the correlation in the model. When the correlations between all possible pairs of observations of the explanatory variables are equal, there is compound symmetry. When in longitudinal research the observations are more correlated around the start, it's an autoregressive structure. When assumptions about the pattern of correlation are best avoided, it's called unstructured. An intraclass correlation means that subjects within a group are alike. The random effects aren't just subjects, they can also be clusters of similar subjects.
The advantages of linear mixed models compared to repeated measures ANOVA is that they make less assumptions and that the consequences of missing data are less severe. When data is missing randomly, bias doesn't need to happen. Linear mixed models can be extended and twisted in all sorts of ways, even for special kinds of correlation.
How do you make a multiple regression model for extreme or strongly correlating data? – Chapter 14
What strategies are available when selecting a model?
Three basic rules for selecting variables to add to a model are:
- Select variables that can answer the theoretical purpose (accepting/rejecting the null hypothesis), with sensible control variables and mediating variables
- Add enough variables for a good predictive power
- Keep the model simple
The explanatory variables should be highly correlated to the response variable but not to each other. Software can test and select explanatory variables. Possible strategies are backward elimination, forward selection and stepwise regression. In backward elimination all possible variables are added, tested for their P-value and then only the significant variables are selected. Forward selection starts from scratch, adding variables with the lowest P-value. Another version of this is stepwise regression, this method removes redundant variables when new variables are added.
Software helps but it's up to the researcher to think and make choices. It also matters whether research is explanatory, starting with a theoretical model with known variables, or whether research is exploratory, openly looking for explanations of a phenomenon.
Several criteria are indications of a good model. To find a model with big power but without an overabundance of variables, the adjusted R2 is used:
\[R_{adj}^2=\frac{s_y^2-s^2}{s_y^2}=1-\frac{s^2}{s_y^2}\]
The adjusted R2 decreases when an unnecessary variable is added.
Cross-validation continuously checks whether the predicted values are as close as possible to the observed values. The result is the predicted residual sum of squares (PRESS):
\[PRESS=\sum{(y_i-\hat{y_{(i)}})^2}\]
If PRESS decreases, the predictions get better. However, this test assumes a normal distribution. A method that can handle other distributions, is the Akaike information criterion (AIC), which selects the model in which ŷi is as close as possible to E(yi). If AIC decreases, the predictions get better.
How can you tell when a statistical model doesn't fit?
Inference of parameters in a regression model has the following assumptions:
- The model fits the shape of the data
- The conditional distribution of y is normal
- The standard deviation is constant in the range of values of the explanatory variables (this is called homoscedasticity)
- It's a random sample
Big violations of these assumptions have consequences.
When y has the normal distribution, the residuals do too. A studentized residual is a standardized version: the residual divided by the standard error. This indicates how much of the variabilities in the residuals is explained by the variability of the sampling. A studentized residual exceeding 3 may indicate an outlier.
The randomization in longitudinal research may be limited when the observations within a certain time frame are strongly correlated. A scatterplot of the residuals for the entire time frame can check this. This kind of correlation has a bad influence on most statistics. In longitudinal research, often conducted within social science and in a relatively short time frame, a linear mixed model is used. However, when research involves time series and a longer time frame, then econometric methods are more appropriate.
Lots of statistics measure the effects of outliers. The residuals measures how far y is from the trend. The leverage (h) measures how far the explanatory variables are from their means. When observations have a high residual and high leverage, they have a big influence.
DFBETA describes the effect of an observation on the estimates of the parameters. DFFIT and Cook's distance describe the effect on how a graph fits the data when a certain observation is omitted.
How do you detect multicollinearity and what are its consequences?
In case of lots of strongly correlated explanatory variables, R² increases only slightly when more variables are added. This is called multicollinearity. It causes the standard errors to increase. Due to the bigger confidence interval, the variance increases. This is measured by the variance inflation factor (VIF), the multiplied increase in variance that is caused by the correlation of the explanatory variables:
\[VIF=\frac{1}{(1-R_j^2)}\]
Also without the VIF indications of multicollinearity are visible in the equation. What helps against it, is choosing only some variables, converging variables or centering them. With factor analysis new, artificial variables are created from the existing variables, to avoid correlation. But usually factor analysis isn't necessary.
What are the characteristics of generalized linear models?
Generalized linear models (GLM) is a broad term that includes both regression models with a normal distribution, alternative models for continuous variables without a normal distribution and models with discrete variables.
The outcome of a GLM is often binary, sometimes counts. When the data is very discrete, the GLM uses the gamma distribution.
A GLM has a link function; an equation that connects the mean of the response variable to the explanatory variables: g(μ) = α + β1x1 + β2x2 + … + βpxp. When the data can't be negative, the log link is used for loglinear models: log(μ) = α + β1x1 + β2x2 + … + βpxp. A logistic regression model uses the logit link: g(μ) = log[μ /(1-μ)]. This is useful when μ is between 0 and 1. Most simple is the identity link: g(μ) = μ.
Because a GLM uses the maximum likelihood method, the data doesn't need to have a normal distribution. The maximum likelihood method uses weighted least squares, this method gives more weight to observations with a smaller variability.
A gamma distribution allows different shapes of the standard deviation. This is called heteroscedasticity; the standard deviation increases when the mean increases. Then the variance is ø μ2 and
\[\text{standard deviation}=\sqrt{\phi\mu}\]
ø is the scale parameter, it indicates the scale that creates the shape of a distribution (for instance a bell).
What is polynomial regression?
When a graph is very nonlinear, but for instance curvilinear, the polynomial regression function is used: E(y) = α + β1x + β2x2 in which the highest power is called the degree. A polynomial regression function can express a quadratic regression model, a parabola.
A cubic function is a polynomial function with three degrees, but usually a function with two degrees suffices. For a straight line the slope maintains the same shape, but in a polynomial function it changes. When the coefficient of x² is positive, the data will be shaped like an inverted U. When the coefficient is negative, the data will be shaped like an U. The highest or lowest point of the parabola is: x = – β1 / 2(β2).
In these kind of models R² is the proportional decrease of the error estimates by using a quadratic function instead of a linear function. A comparison of R² and r² indicates how much better of a fit the quadratic function is. The null hypothesis can be tested that a quadratic function doesn't add to the model: H0: β2 = 0.
Conclusions should be made carefully, sometimes other shapes are possible too. Parsimony should be the goal, models shouldn't have more parameters than necessary.
What do exponential regression and log transforms look like?
An exponential regression function is E(y) = α βx. It only has positive values and either increases or decreases endlessly. The logarithm of the mean is: log(μ) = log α + (log β)x. In this model, β is the multiplied change of y for an increase of 1 in x. When a graph needs to be transformed into a linear function, then log transforms can be used, they linearize the relationship.
What are robust variance and nonparametric regression?
Robust variance allows mending regression models so they can handle violations of assumptions. This method uses the least squares line but doesn't assume that the variance in finding standard errors is constant. Instead, the standard errors are adjusted to the variability of the sample data. This is called the sandwich estimate or robust standard error estimate. Software can sometimes calculate these standard errors, then they can be compared to the regular standard errors. If they differ a lot, the assumptions are badly violated. The robust variance can be applied to strongly correlated data like clusters. Then generalized estimating equations (GEE) are used; estimates of equations with the maximum likelihood but without the parametric probability distribution that usually goes along with correlations.
A recently developed nonparametric method is generalized additive modelling. This is a generalization of the generalized linear model. Smoothing a curve exposes larger trends. Popular smoothers are LOESS and kernel.
What is logistic regression? – Chapter 15
What are the basics of logistic regression?
A logistic regression model is a model with a binary response variable (like 'agree' or 'don't agree'). It's also possible for logistic regression models to have ordinal or nominal response variables. The mean is the proportion of responses that are 1. The linear probability model is P(y=1) = α + βx. This model often is too simple.
The logarithm can be calculated using software. The odds are: P(y=1)/[1-P(y=1)]. The log of the odds, or logistic transformation (abbreviated as logit) is the logistic regression model: logit[P(y=1)] = α + βx.
To find the outcome for a certain value of a predictor, the following formula is used:
\[P(y=1)=\frac{e^{\alpha+\beta x}}{1+e^{\alpha+\beta x}}\]
The e to a certain power is the antilog of that number.
A straight line is drawn next to the curve of a logistic graph to analyze it. β is maximal where P(y=1) = ½. For logistic regression the maximal likelihood method is used instead of the least squares method. The model expressed in odds is:
\[\frac{P(y=1)}{1-P(y=1)}=e^{\alpha+\beta x}=e^{\alpha}(e^{\beta})^x\]
With this the odds ratio can be calculated.
There are two possibilities to present the data. For ungrouped data a normal contingency table suffices. For grouped data a row contains data for every count in a cel, like just one row with the number of subjects that agreed, followed by the total number of subjects.
An alternative of the logit is the probit. This link assumes a hidden, underlying continuous variable y* that is 1 above a certain value T (threshold) and that is 0 below T. Because y* is hidden, it's called a latent variable. However, it can be used to make a probit model: probit[P(y=1)] = α + βx.
Logistic regression with repeated measures and random effects is analyzed with a linear mixed model: logit[P(yij = 1)] = α + βxij + si.
What does multiple logistic regression look like?
The multiple logistic regression model is: logit[P(y = 1)] = α + β1x1 + … + βpxp. The further βi is from 0, the stronger the effect of xi is and the further the odds ratio is from 1. If needed, cross-product terms and dummy variables can be added.
Research results are often expressed in terms of odds instead of the log odds scale, because the odds are easier to interpret. The odds is the multiplied version of the antilog. To present the results even more clearly, they're expressed in probabilities. For instance the chance that a certain value is the output, while controlling for the other variables. The estimated probability is:
\[P(y=1)=\frac{e^{\alpha+\beta_1x_1+...+\beta_px_p}}{1+e^{\alpha+\beta_1x_1+...+\beta_px_p}}\]
The standardized estimate allows to compare the effects of explanatory variables using different units of measurement:
\[\hat{\beta_j*} = \hat{\beta_j}s_{x_j}\]
The sxj is the standard deviation of the variable xj.
To help prevent selection bias in observation studies, the propensity is used, the probability that a subjects ends up in a certain group. By managing this, researchers can control and group the kind of people that find themselves in a certain situation. However, this only manages observed confounding variables. Variables unknown to the researchers remain hidden.
How does inference with logistic regression models work?
A logistic regression model assumes the binomial distribution and is shaped like this: logit[P(y = 1)] = α + β1x1 + … + βpxp. The general null hypothesis is H0 : β1 = … = βp = 0 and is tested by the likelihood-ratio test. This inferential test compares a complete model to a reduced model. The likelihood function (ℓ) is the probability that the observed data result from the parameter values. For instance, ℓ0 is the maximal likelihood function if the null hypothesis is true and ℓ1 when it is not true. The test statistic is: -2 log (ℓ0 /ℓ1 ) = (-2 log ℓ0 ) – (-2 log ℓ1 ).
Alternative test statistics are z and z squared (called the Wald statistic):
\[z=\frac{\hat{\beta}}{se}\]
But for small samples or extreme effects the likelihood ratio test works better.
How is logistic regression performed for ordinal variables?
Ordinal variables assume a certain order in the categories. The cumulative probability is the probability that a response falls in a certain category j or below: P(y ≤ j). Each cumulative probability can be transformed to odd, for instance the odds that a response falls in category j or below: P(y ≤ j) / P(y > j).
Cumulative logits are popular, these divide the responses into a binary scale: logit[P(y ≤ 1)] = αj – βx in which j = 1, 2, …, c – 1 and c is the number of categories. Beware, some software puts + instead of – in front of the slope.
A proportional odds model is a cumulative logit model in which the slope is the same for every cumulative probability, so β doesn't vary. The slope indicates the steepness of the graph, so in a proportional odds model the lines of the different categories are equally steep.
Cumulative logit models can have multiple explanatory variables. H0 : β tests whether they are independent. An independence test for logistic regression with ordinal variables results in a P-value that is more clear than tests that ignore the order in the data, like the chi squared test. A confidence interval is also an option.
An advantage of the cumulative logit model is invariance towards the scale of responses. If a researcher uses a different number of categories, he/she will still reach the same conclusions.
What do logistic models with nominal responses look like?
For nominal variables (without order) a model exists that specifies the probabilities that a certain outcome happens instead of another outcome. This model calculates these probabilities simultanously and it presumes independent observations. This is the baseline-category logit model:
\[log[\frac{P(y=1)}{P(y=3)}]\]
and
\[log[\frac{P(y=2)}{P(y=3)}]\]
It doesn't matter which category is in the baseline. Inference works similarly to logistic regression, but to test the effect of an explanatory variable, all parameters of the comparisons are involved. The likelihood ratio test examines if the model fits the data better with or without a certain value.
How do loglinear models describe the associations between categorical variables?
Most models study the effect of an explanatory variable on a response variable. Loglinear models are different, they study the associations between (categorical) variables, for instance in a contingency table. These models are more alike correlations.
A loglinear model assumes the Poisson distribution; non-negative discrete variables, like counts, based on the multinomial distribution.
A contingency table can show multiple categorical response variables. A conditional association is an association between two variables, while a third variable is controlled for. When variables are conditionally independent, they are independent of each category of the third variable. A hierarchy of dependence is the following (accompanied by symbols for the response variables x, y and z):
- All three are conditionally independent (x, y, z)
- Two pairs are conditionally independent (xy, z)
- One pair is conditionally independent (xy, yz)
- There is no conditional independence, but there is a homogeneous association, meaning the association for each possible pair is the same for each category of the third variable (xy, yz, xz)
- All pairs are associated and there is interaction, this is a saturated model (xyz)
Also linear models can be interpreted using the odds ratio.
How do goodness-of-fit tests work for contingency tables?
A goodness-of-fit test investigates the null hypothesis that a model really fits a certain population. It measures whether the estimated frequencies fe are close to the observed frequencies fo . Bigger test statistics are bigger evidence that the model is incorrect. This is measured by the Pearson chi squared test:
\[X^2=\sum{\frac{(f_0-f_e)^2}{f_e}}\]
Another version is the likelihood ratio chi-squared test:
\[G^2=2\sum{f_0log(\frac{f_0}{f_e})}\]
When the model fits reality perfectly, then both X2 and G2 are 0. The likelihood ratio test is better in case of large samples. The Pearson test is better for frequencies that average between 1 and 10/ Both tests only work well for contingency tables with categorical predictors and relatively big counts.
To see what exactly doesn't fit, the standardized residuals can be calculated per cel: (fo – fe) / (standard error of (fo – fe)). When a standardized residual exceeds 3, for that cel the model doesn't fit the data.
Goodness-of fit tests and standardized residuals can also be applied to loglinear models.
To see if a complete or a reduced model fits better, the likelihood ratios can be compared.
What advanced methodologies are there? - Chapter 16
This chapter gives a short introduction to some advanced statistical methods, focusing on their purpose, the type of results that can occur, and their interpretation.
- Multiple imputation deals with missing data
- Multilevel models handle hierarchically structured observations
- Event history models deal with how long it takes until an event occurs
- Factor analysis is a method to reduce a high number of possibly highly correlated variables to a smaller number of statistically uncorrelated variables
- Structural equation models combine elements of both path analysis and factor analysis
- Markov chains models provide simple dependence structure for sequences of observations
- the Bayesian approach applies probability distributions to parameters and variables
How does multiple imputation work?
An issue in many data analyses is that some data are incomplete, there are missing data. For statistical analyses, some software deletes all subjects for whom data is missing on at least one variable. This is called listwise deletion. Some software deletes only the subjects for analyses for which that observation is needed. This is called pairwise deletion. With both these approaches problems can arise.
Are data missing at random?
Missing data are missing completely at random (MCAR) if the probability that an observation is missing is independent of the observation's value as well as the values of other variables in the set. Data is said to be missing at random (MAR) if the distribution of which observations are missing is not dependent on the values of the missing data. In practice, it is not possible to know and test whether MAR or MCAR is satisfied, as the values of missing data are unknown. Often missing data is not MAR or MCAR, then more complex analyses are needed that require a joint probability distribution for the data and the missingness.
A better approach of dealing with missing data uses multiple imputation. Conducting an imputation represents finding a plausible set of values for the missing data. Multiple implies that this process is repeated several times. Then the results are combined to estimate what we would have found without data missing. Multiple imputation produces more efficient results than analyses using listwise deletion. Also, the results based on multiple imputation are not biased when data are missing at random.
When there's much missingness, analyses should be made with caution, as in practice we cannot know whether the missing data are missing at random.
What are Multilevel (hierarchical) Models?
According to hierarchical models observations have a nested nature: Units at one level are contained within units of another level. Models with a hierarchical structure are called multilevel models. For example, performance on exams is contained within a student, where students in turn then are contained within a school. Observations of students within a school might tend to be more alike than observations of students in different schools. Multilevel models have terms for the different levels of units. There's often a large number of terms, so the model treats terms for the sampled units on which there are multiple observations as random effects, rather than fixed effects.
What are Event History Models?
Some studies have the objective of observing how long it takes before a certain event occurs. Like in ordinary regression, models for the time to an event include effects of explanatory variables. For example, a model for the length of time before rearrest might use predictors like number of previous arrests, employment, marital status, etc. This is called event history analysis.
In event history analysis two complicating factors occur that are not an issue in ordinary regression modeling. First, for some subjects, the event does not occur before the end of the observation period of the study. For example a study on retirement age may use a sample of adults aged at least 65. If a 68 year old person has not yet retired, we only know that the response variable (retirement age) is at least 68. Such an observation is said to be censored. Ignored censored data can lead to a severe bias in parameter estimation.
Second, some explanatory values for predicting the time to the event may change value over time. For example, when observing whether a subject has been rearrested, the value of explanatory variables such as whether the subject is working or living with a partner can differ over time. This type of variable is called a time-dependent covariate.
What is Path Analysis?
Path analysis uses regression models to represent theories of causal relationships among a set of variables. The primary advantage of path analysis is that the researcher must explicitly specify the presumed causal relationships among the variables. This can help contribute to sensible theories of relationships. Theoretical explanations of cause-effect relationships often model a system of relationships in which some variables, caused by other variables, in turn affect yet others. Path analysis uses all necessary regression models to include all proposed relationships in the theoretical explanation. Path coefficients show the direction and relative size of effects of explanatory variables, controlling for other variables in the sequence.
Most path models have intervening variables. These variables are dependent on other variables but, are in turn, causes for other variables. Variables can have an indirect effect, through an intervening variable, or a direct effect. For example, child intelligence can have a direct effect on the child's educational attainment. But it can also have an indirect effect through affecting the child's achievement motivation, which then in turn affects the child's educational attainment. This way the regression analyses as part of path analysis can reveal whether significant evidence exists of the various effects. When conducting the regression analyses, when we find a nonsignificant path we can erase this path and re-estimate the coefficients of the remaining paths.
The basic steps in path analysis are:
- Set up a theory to be tested, drawing the path diagram without the path coefficients.
- Conduct the necessary regression modeling to estimate the path coefficients and the residual coefficients.
- Evaluate the model, checking with the sample results. Then reformulate, erasing nonsignificant paths.
What is Factor Analysis?
Factor analysis is used for a wide array of purposes. Such as:
- Revealing patterns of interrelationships among variables.
- Detecting clusters of variables each of which are intercorrelated and hence somewhat redundant.
- Reducing a large number of variables to a smaller number of statistically uncorrelated variables: the factors.
The model of factor analysis expresses the expected values of observable variables as linear functions of unobservable variables, called factors. In statistics these are called latent variables. Factors in factor analysis are summaries of the observed variables. The correlation of a variable with a factor is the loading of the variable on that factor. The sum of squared loadings for a variable is called its communality. An exploratory form of factor analysis looks for the appropriate amount of factors guided by eigenvalues. Confirmatory analysis preselects a particular value for the number of factors. Results are more believable when used in a confirmatory mode, as this forces researchers to think more carefully about reasonable factor structure before performing the analysis.
What are Structural Equation Models?
The covariance structure model combines path analysis and factor analysis to attempt to explain the variances and correlations among the observed variables. Covariance structure models have two components. First the measurement model, which resembles a factor analysis, then the structural equation model which resembles a path analysis.
The measurement model specifies how observed variables relate to a set of latent variables. This resembles the factor analysis, but has a more highly specified structure. The structural equation model uses regression models to specify causal relationships among the latent variables. One or more latent variables are identified as response variables, the rest as explanatory variables. This approach allows for the fitting of models with two-way causation, in which latent variables may be regressed on each other.
Covariance structure models have the features of flexibility and generality. A regression parameter can be fixed, by being forced to take a fixed value, such as 0. It can be forced to equal another parameter in the system, then it is called a constrained parameter. Or it can be completely unknown, a free parameter. Good aspects of the covariance structure models are that the models force researchers to provide theoretical underpinnings to their analyses and inferential methods check the fit of the theoretical model to the data. However, the model is complex and may require a large sample size to obtain good estimates of effects.
What are Markov Chains?
Sometimes researchers are interested in sequences of response observations (usually over time). A sequence of observations that varies randomly is called a stochastic process. The possible values at each step are the states of the process. One of the simplest stochastic processes is the Markov chain. This is appropriate if, given the behavior of the process at times t, t-1, t-2, ..., 1, the probability distribution of the outcome at time t+1 depends only on the outcome at time t.
A common probability is the transition probability, for this the Markov chain studies questions such as:
- What is the probability of moving from one state to another, within a particular amount of time?
- How long, on average, does it take to move from one state to another?
- Are the transition probabilities between each pair of states constant over time? if so, the process has stationary transition probabilities.
- Is the process a Markov chain, or is the dependence structure more complex?
Usually the Markov chain model is too simplistic by itself to have practical use, but it often forms a component of a more complex and realistic model.
What is the Bayesian approach?
The Bayesian approach applies probability distribution to parameters as well as data. The prior distribution describes knowledge about the parameters in a particular analysis before we see all the data. The Bayesian method generates a posterior distribution, which combines that prior information with the knowledge of all the data after observing the data.
Join with a free account for more service, or become a member for full access to exclusives and extra support of WorldSupporter >>
Contributions: posts
Spotlight: topics
Online access to all summaries, study notes en practice exams
- Check out: Register with JoHo WorldSupporter: starting page (EN)
- Check out: Aanmelden bij JoHo WorldSupporter - startpagina (NL)
How and why use WorldSupporter.org for your summaries and study assistance?
- For free use of many of the summaries and study aids provided or collected by your fellow students.
- For free use of many of the lecture and study group notes, exam questions and practice questions.
- For use of all exclusive summaries and study assistance for those who are member with JoHo WorldSupporter with online access
- For compiling your own materials and contributions with relevant study help
- For sharing and finding relevant and interesting summaries, documents, notes, blogs, tips, videos, discussions, activities, recipes, side jobs and more.
Using and finding summaries, notes and practice exams on JoHo WorldSupporter
There are several ways to navigate the large amount of summaries, study notes en practice exams on JoHo WorldSupporter.
- Use the summaries home pages for your study or field of study
- Use the check and search pages for summaries and study aids by field of study, subject or faculty
- Use and follow your (study) organization
- by using your own student organization as a starting point, and continuing to follow it, easily discover which study materials are relevant to you
- this option is only available through partner organizations
- Check or follow authors or other WorldSupporters
- Use the menu above each page to go to the main theme pages for summaries
- Theme pages can be found for international studies as well as Dutch studies
Do you want to share your summaries with JoHo WorldSupporter and its visitors?
- Check out: Why and how to add a WorldSupporter contributions
- JoHo members: JoHo WorldSupporter members can share content directly and have access to all content: Join JoHo and become a JoHo member
- Non-members: When you are not a member you do not have full access, but if you want to share your own content with others you can fill out the contact form
Quicklinks to fields of study for summaries and study assistance
Main summaries home pages:
- Business organization and economics - Communication and marketing -International relations and international organizations - IT, logistics and technology - Law and administration - Leisure, sports and tourism - Medicine and healthcare - Pedagogy and educational science - Psychology and behavioral sciences - Society, culture and arts - Statistics and research
- Summaries: the best textbooks summarized per field of study
- Summaries: the best scientific articles summarized per field of study
- Summaries: the best definitions, descriptions and lists of terms per field of study
- Exams: home page for exams, exam tips and study tips
Main study fields:
Business organization and economics, Communication & Marketing, Education & Pedagogic Sciences, International Relations and Politics, IT and Technology, Law & Administration, Medicine & Health Care, Nature & Environmental Sciences, Psychology and behavioral sciences, Science and academic Research, Society & Culture, Tourisme & Sports
Main study fields NL:
- Studies: Bedrijfskunde en economie, communicatie en marketing, geneeskunde en gezondheidszorg, internationale studies en betrekkingen, IT, Logistiek en technologie, maatschappij, cultuur en sociale studies, pedagogiek en onderwijskunde, rechten en bestuurskunde, statistiek, onderzoeksmethoden en SPSS
- Studie instellingen: Maatschappij: ISW in Utrecht - Pedagogiek: Groningen, Leiden , Utrecht - Psychologie: Amsterdam, Leiden, Nijmegen, Twente, Utrecht - Recht: Arresten en jurisprudentie, Groningen, Leiden
JoHo can really use your help! Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world
2026 |
Add new contribution