How do linear regression and correlation work? – Chapter 9

9.1 What are linear associations?
9.2 What is the least squares prediction equation?
9.3 What is a linear regression model?
9.4 How does the correlation measure the association of a linear function?
9.5 How do you predict the slope and the correlation?
9.6 What happens when the assumptions of a linear model are violated?

9.1 What are linear associations?

Regression analysis is the process of researching associations between quantitative response variables and explanatory variables. It has three aspects: 1) investigating whether an association exists, 2) determining the strength of the association and 3) making a regression equation to predict the value of the response variable using the explanatory variable.

The response variable is denoted as y and the explanatory variable as x. A linear function means that there is a straight line throughout the points of data in a graph. A linear function is: y = α + β (x). In this alpha (α) is the y-intercept and beta (β) is the slope.

The x-axis is the horizontal axis and the y-axis is the vertical axis. The origin is the point where x and y are both 0.

The y-intercept is the value of y when x = 0. In that case β(x) equals 0, only y = α remains. The y-intercept is the location where the line starts on the y-axis.

The slope (β) indicates the change in y for a change of 1 in x. So the slope is an indication of how steep the line is. Generally when β increases, the line becomes steeper.

When β is positive, then y increases when x increases (a positive relationship). When β is negative, then y decreases when x increases (a negative relationship). When β = 0, the value of y is constant and doesn't change when x changes. This results in a horizontal line and means that the variables are independent.

A linear function is an example of a model; a simplified approximation of the association between variables in the population. A model can be good or bad. A regression model usually means a model more complex than a linear function.

9.2 What is the least squares prediction equation?

In regression analysis α and β are regarded as unknown parameters that can be estimated using the available data. Each value of y is a point in a graph and can be written with its coordinates (x, y). A graph is used as a visual check whether it makes sense to make a linear function. If the data is U-shaped, a straight line doesn't make sense.

The variable y is estimated by ŷ. The equation is estimated by the prediction equation: ŷ = a + b(x). This line is the best line; the line closest to all data points. In the prediction equation is a = ȳ – bx̄ and:

$b = \frac{\sum (x-\bar{x})(y-\bar{y})}{\sum (x - \bar{x})^2}$

A regression outlier is a data point far outside the trend of the other data points. It's called influential when removing it would cause a big change for the prediction equation. The effect is smaller for large datasets. Sometimes it's better for the prediction equation to leave the outlier out and explain this when reporting the results.

The prediction equation estimates the values of y, but they won't completely match the actual observed values. Studying the differences indicates the quality of the prediction equation. The difference between an observed value (y) and the predicted value (ŷ) is called a residual, this is y – ŷ. When the observed value is bigger, the residual is positive. When the observed value is smaller, the residual is negative. The smaller the absolute value of the residual, the better the prediction is.

The best prediction equation has the smallest residuals. To find it, the SSE (sum of squared errors) is used. SSE tells how good or bad ŷ is in predicting y. The formula of the SSE is: Σ (y – ŷ)².

The least quares estimates a and b in the least squares line ŷ = a + b(x) have the values for which SSE is as small as possible. It results in the best possible line that can be drawn. In most software SSE is called the residual sum of squares.

The SSE of the best regression line has both negative and positive residuals (that all become positive by squaring them), of which the sum and the mean are 0. The best line intersects the mean of x and the mean of y, so it intersects (x̄, ȳ), the center of the data.

9.3 What is a linear regression model?

In y = a + b(x) there is the same sort of y-value for every x-value. This is a deterministic model. Usually this isn't how reality works. For instance when age (x) predicts the number of relationships someone has been in (y), then not everybody has had the same number at age 22. In that case a probabilistic model is better; a model that allows variability in the y-value. The data can then be visualized in a conditional distribution, a distribution that has the extra condition that x has a certain value.

A probabilistic model shows the mean of the y-values, not the actual values. The formula of a conditional distribution is E(y) = α + β (x). The symbol E means the expected value. When for instance people aged 22 have had different numbers of relationships, the probabilistic model can predict the mean number of relationships.

A regression function is a mathematical equation that describes how the mean of the response variable changes when the value of the explanatory variable changes.

Another parameter of the linear regression model is the standard deviation of a conditional distribution, σ. This parameter measures the variability of the y-values for all person with a certain x-value. This is called the conditional standard deviation.

Because the real standard deviation is unknown, the sample standard deviation is used:

$s = \sqrt{\frac{SSE}{n - 2}} = \sqrt{\frac{\sum (y - \bar{y})^2}{n - 2}}$

The assumption is made that the standard deviation is the same for every x-value. If the variability would differ per distribution of a value of x, then s would indicate the mean variability. The Mean Square Error (MSE) is s squared. In software the conditional standard deviation has several names: Standard error of the estimate (SPSS), Residual standard error (R ), Root MSE (Stata and SAS).

The degrees of freedom for a regression function are df = n – p, in which p is the number of unknown parameters. In E(y) = α + β (x) there are two unknown parameters (α and β) so df = n – 2.

The conditional standard deviation depends both on y and on x and is written as σ_y|x (for the population) and s_y|x (for the sample), shortened σ and s. In a marginal distribution the standard deviation only depends on y, so this is written as σ_y(for the population) and s_y (for the sample). The formula of a point estimate of the standard deviation is:

$\sqrt{\frac{\sum (y - \bar{y})^2}{n - 1}}$

The upper part in the root, Σ (y – ȳ)², is the total sum of squares. The marginal standard deviation (independent of x) and the conditional standard deviation (dependent on a certain x) can be different.

9.4 How does the correlation measure the association of a linear function?

The slope tells how steep a line is and whether the association is negative or positive, but it doesn't tell how strong the association between two variables is.

The association is measured by the correlation (r). This is standardized version of the slope. It is also called the standardized regression coefficient or Pearson correlation. The correlation is the value that the slope would have if the variables would have an equal variability. The formula is:

$r = \frac{\sum (x - \bar{x})(y-\bar{y})}{\sqrt{[\sum (x - \bar{x})^2][\sum (y - \bar{y})^2]}}$

In regard to the slope (b), the r is: r = (s_x / s_y) b, in which s_x is the standard deviation of x and s_y is the standard deviation of y.

The correlation has the following characteristics:

It can only be used if a straight line makes sense.
It lies between 1 and -1.
It is positive/negative, the same as b.
If b is 0, then r is 0, because then there is no slope and no association.
If r increases, then the linear association is stronger. If r is exactly -1 or 1, then the linear association is perfectly negative or perfectly positive, without errors.
The r does not depend on units of measurement.

The correlation implies regression towards the mean. This means that when r increases, the association is stronger between the standard deviation of x and the proportion of the standard deviation of y.

The coefficient of determination r² is r-squared and it indicates how good x can predict y. It measures how good the least squares line ŷ = a + b(x) predicts y compared to the prediction of ȳ.

The r² has four elements;

Rule 1: y is predicted, no matter what x is. The best prediction then is the sample mean ȳ.
Rule 2: y is predicted by x. The prediction equation ŷ = a + b(x) predicts y.
E₁ are the errors of rule 1 and E₂ are the errors of rule 2.
The proportional limit of the number of errors is the coefficient of determination: r² = (E₁ - E₂) / E₁ in which E₁ = Σ (y – ȳ)², this is the total sum of squares (TSS). In this E₂ = Σ (y – ŷ)², this is the SSE.

R-squared has a number of characteristics similar to r:

Because r is between 1 and -1, the r² needs to be between 0 and 1.
When SSE = 0, then r² = 1. All points are on the line.
When b = 0, then r² = 0.
The closer r² is to 1, the stronger the linear association is.
The units of measurement and which variable is the explanatory one (x or y), don't matter for r².

The TSS describes the variability in the observations of y. The SSE describes the variability of the prediction equation. The coefficient of determination indicates how many % the variance of a conditional distribution is bigger or smaller than that of a marginal distribution. Because the coefficient of determination doesn't use the original scale but a squared version, some researcher prefer the standard deviation and the correlation because the information they give is easier to interpret.

9.5 How do you predict the slope and the correlation?

For categorical variables, chi-squared test is used to test for independence. For quantitative variables, the confidence test of the slope or of the correlation provides a test for independence.

The assumptions for inference applied to regression are:

Randomization
The mean of y is approximated by E(y) = α + β (x)
The conditional standard deviation σ is equal for every x-value
The conditional distribution of y for every x-value is a normal distribution

The null hypothesis is H₀ : β = 0 (in that case there is no slope and the variables are independent), the alternative hypothesis is H_a : β ≠ 0.

The t-score is found by dividing the sample slope (b) by the standard error of b. The formula is t = b / se. This formula is similar to the formula for every t-score; the estimate minus the null hypothesis (0 in this case), divided by the standard error of the estimate. You can find the P-value for df = n – 2. The standard error of b is:

$se = \frac{s}{\sqrt{\sum (x - \bar{x})^2}}$ in which $s = \sqrt{\frac{SSE}{n-2}}$

The smaller the standard deviation s, the more precise b estimates β.

The correlation is denoted by the Greek letter ρ. The ρ is 0 in the same situations in which β = 0. A test whether H₀ : ρ = 0 is performed in the same way as a test for the slope. For the correlation the formula is:

$t = \frac{r}{\sqrt{\frac{1-r^2}{n-2}}}$

When many variables possibly influence a response variable, these can be portrayed in a correlation matrix. For each variable the correlation can be calculated.

A confidence interval gives more information about a slope than an independence test. The confidence interval of the slope β is: b ± t(se).

Calculating a confidence interval for a correlation is more difficult, because the sampling distribution isn't symmetrical unless ρ = 0.

R² indicates how good x predicts y and it depends on TSS (the variability of the observations of y) and SSE (the variability of the prediction equation). The difference, TSS – SSE, is called the regression sum of squares or the model sum of squares. This difference is the total variability in y that is explained by x using the least squares line.

9.6 What happens when the assumptions of a linear model are violated?

Often the assumption is made that a linear association exists. It's important to check the data in a scatterplot first to see whether a linear model makes sense. If the data is U-shaped, then a straight line doesn't make sense. Making this error could cause the result of an independence test of the slope to be wrong.

Other assumptions are that the distribution is normal and that σ is identical for every x-value. Even when the distribution isn't normal, then the least squares line, the correlation and the coefficient of determination are still useful. But if the standard deviation isn't equal, then other methods are more efficient than the least squares line.

Some outliers have big effects on the regression lines and the correlations. Sometimes outliers need to be taken out. Even one point can have a big influence, particularly for a small sample.

The assumption of randomization, both for x and y, is important for the correlation. If there is no randomization and the variability is small, then the sample correlation will be small and it will underestimate the population correlation. For other aspects of regression, like the slope, the assumption of randomization is less important.

The prediction equation shouldn't be extrapolated and used for (non-existent) data points outside of the range of the observed data. This could have absurd results, like things that are physically impossible.

The theoretical risk exists that the mean of y for a certain value of x doesn't estimate the actual individual observation properly. The Greek letter epsilon (ε) denotes the error term; how much y differs from the mean. The population model is y = α + β x + ε and the sample prediction equation is y = a + bx + e. The ε is also called the population residual.

A model is only an approximation of reality. It shouldn't be too simple. If a model is too simple, it should be adjusted.

Access:

Public

Join WorldSupporter!

Join with a free account for more service, or become a member for full access to exclusives and extra support of WorldSupporter >>

This content is related to:

Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)

Check more of topic:

Samenvattingen voor psychologie en gedrag

Universiteit Groningen en studieverenigingen

This content is used in:

Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)

Going abroad?

Insure your way around the world

International expat insurances

Travel & Worldsupporter insurances (NL)

Study with summaries

Contributions: posts

Help other WorldSupporters with additions, improvements and tips

Spotlight: topics

Check the related and most recent topics and summaries:

Activities abroad, study fields and working areas:

Which kinds of samples and variables are possible? – Chapter 2

What are the main measures and graphs of descriptive statistics? - Chapter 3

What role do probability distributions play in statistical inference? – Chapter 4

How can you make estimates for statistical inference? – Chapter 5

How do you perform significance tests? – Chapter 6

How do you compare two groups in statistics? - Chapter 7

How do you analyze the association between categorical variables? – Chapter 8

How do linear regression and correlation work? – Chapter 9

Which types of multivariate relationships exist? – Chapter 10

What is multiple regression? – Chapter 11

What is ANOVA? – Chapter 12

How does multiple regression with both quantitative and categorical predictors work? – Chapter 13

How do you make a multiple regression model for extreme or strongly correlating data? – Chapter 14

What is logistic regression? – Chapter 15

Check how to use summaries on WorldSupporter.org

Submenu: Summaries & Activities

Follow the author: Annemarie JoHo

Work for WorldSupporter

JoHo can really use your help! Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world

Working for JoHo as a student in Leyden

Parttime werken voor JoHo

Statistics

Search a summary, study help or student organization

Select any filter and click on Search to see results

How do linear regression and correlation work? – Chapter 9

9.1 What are linear associations?

9.2 What is the least squares prediction equation?

9.3 What is a linear regression model?

9.4 How does the correlation measure the association of a linear function?

9.5 How do you predict the slope and the correlation?

9.6 What happens when the assumptions of a linear model are violated?

Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)

Samenvattingen voor psychologie en gedrag

Universiteit Groningen en studieverenigingen

Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)

Contributions: posts

Add new contribution

Spotlight: topics

Samenvattingen voor psychologie en gedrag

Universiteit Groningen en studieverenigingen

Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)

Online access to all summaries, study notes en practice exams

How and why use WorldSupporter.org for your summaries and study assistance?

Using and finding summaries, notes and practice exams on JoHo WorldSupporter

Quicklinks to fields of study for summaries and study assistance