How do linear regression and correlation work? – Chapter 9

9.1 What are linear associations?
9.2 What is the least squares prediction equation?
9.3 What is a linear regression model?
9.4 How does the correlation measure the association of a linear function?
9.5 How do you predict the slope and the correlation?
9.6 What happens when the assumptions of a linear model are violated?

9.1 What are linear associations?

Regression analysis is the process of researching associations between quantitative response variables and explanatory variables. It has three aspects: 1) investigating whether an association exists, 2) determining the strength of the association and 3) making a regression equation to predict the value of the response variable using the explanatory variable.

The response variable is denoted as y and the explanatory variable as x. A linear function means that there is a straight line throughout the points of data in a graph. A linear function is: y = α + β (x). In this alpha (α) is the y-intercept and beta (β) is the slope.

The x-axis is the horizontal axis and the y-axis is the vertical axis. The origin is the point where x and y are both 0.

The y-intercept is the value of y when x = 0. In that case β(x) equals 0, only y = α remains. The y-intercept is the location where the line starts on the y-axis.

The slope (β) indicates the change in y for a change of 1 in x. So the slope is an indication of how steep the line is. Generally when β increases, the line becomes steeper.

When β is positive, then y increases when x increases (a positive relationship). When β is negative, then y decreases when x increases (a negative relationship). When β = 0, the value of y is constant and doesn't change when x changes. This results in a horizontal line and means that the variables are independent.

A linear function is an example of a model; a simplified approximation of the association between variables in the population. A model can be good or bad. A regression model usually means a model more complex than a linear function.

9.2 What is the least squares prediction equation?

In regression analysis α and β are regarded as unknown parameters that can be estimated using the available data. Each value of y is a point in a graph and can be written with its coordinates (x, y). A graph is used as a visual check whether it makes sense to make a linear function. If the data is U-shaped, a straight line doesn't make sense.

The variable y is estimated by ŷ. The equation is estimated by the prediction equation: ŷ = a + b(x). This line is the best line; the line closest to all data points. In the prediction equation is a = ȳ – bx̄ and:

$b = \frac{\sum (x-\bar{x})(y-\bar{y})}{\sum (x - \bar{x})^2}$

A regression outlier is a data point far outside the trend of the other data points. It's called influential when removing it would cause a big change for the prediction equation. The effect is smaller for large datasets. Sometimes it's better for the prediction equation to leave the outlier out and explain this when reporting the results.

The prediction equation estimates the values of y, but they won't completely match the actual observed values. Studying the differences indicates the quality of the prediction equation. The difference between an observed value (y) and the predicted value (ŷ) is called a residual, this is y – ŷ. When the observed value is bigger, the residual is positive. When the observed value is smaller, the residual is negative. The smaller the absolute value of the residual, the better the prediction is.

The best prediction equation has the smallest residuals. To find it, the SSE (sum of squared errors) is used. SSE tells how good or bad ŷ is in predicting y. The formula of the SSE is: Σ (y – ŷ)².

The least quares estimates a and b in the least squares line ŷ = a + b(x) have the values for which SSE is as small as possible. It results in the best possible line that can be drawn. In most software SSE is called the residual sum of squares.

The SSE of the best regression line has both negative and positive residuals (that all become positive by squaring them), of which the sum and the mean are 0. The best line intersects the mean of x and the mean of y, so it intersects (x̄, ȳ), the center of the data.

9.3 What is a linear regression model?

In y = a + b(x) there is the same sort of y-value for every x-value. This is a deterministic model. Usually this isn't how reality works. For instance when age (x) predicts the number of relationships someone has been in (y), then not everybody has had the same number at age 22. In that case a probabilistic model is better; a model that allows variability in the y-value. The data can then be visualized in a conditional distribution, a distribution that has the extra condition that x has a certain value.

A probabilistic model shows the mean of the y-values, not the actual values. The formula of a conditional distribution is E(y) = α + β (x). The symbol E means the expected value. When for instance people aged 22 have had different numbers of relationships, the probabilistic model can predict the mean number of relationships.

A regression function is a mathematical equation that describes how the mean of the response variable changes when the value of the explanatory variable changes.

Another parameter of the linear regression model is the standard deviation of a conditional distribution, σ. This parameter measures the variability of the y-values for all person with a certain x-value. This is called the conditional standard deviation.

Because the real standard deviation is unknown, the sample standard deviation is used:

$s = \sqrt{\frac{SSE}{n - 2}} = \sqrt{\frac{\sum (y - \bar{y})^2}{n - 2}}$

The assumption is made that the standard deviation is the same for every x-value. If the variability would differ per distribution of a value of x, then s would indicate the mean variability. The Mean Square Error (MSE) is s squared. In software the conditional standard deviation has several names: Standard error of the estimate (SPSS), Residual standard error (R ), Root MSE (Stata and SAS).

The degrees of freedom for a regression function are df = n – p, in which p is the number of unknown parameters. In E(y) = α + β (x) there are two unknown parameters (α and β) so df = n – 2.

The conditional standard deviation depends both on y and on x and is written as σ_y|x (for the population) and s_y|x (for the sample), shortened σ and s. In a marginal distribution the standard deviation only depends on y, so this is written as σ_y(for the population) and s_y (for the sample). The formula of a point estimate of the standard deviation is:

$\sqrt{\frac{\sum (y - \bar{y})^2}{n - 1}}$

The upper part in the root, Σ (y – ȳ)², is the total sum of squares. The marginal standard deviation (independent of x) and the conditional standard deviation (dependent on a certain x) can be different.

9.4 How does the correlation measure the association of a linear function?

The slope tells how steep a line is and whether the association is negative or positive, but it doesn't tell how strong the association between two variables is.

The association is measured by the correlation (r). This is standardized version of the slope. It is also called the standardized regression coefficient or Pearson correlation. The correlation is the value that the slope would have if the variables would have an equal variability. The formula is:

$r = \frac{\sum (x - \bar{x})(y-\bar{y})}{\sqrt{[\sum (x - \bar{x})^2][\sum (y - \bar{y})^2]}}$

In regard to the slope (b), the r is: r = (s_x / s_y) b, in which s_x is the standard deviation of x and s_y is the standard deviation of y.

The correlation has the following characteristics:

It can only be used if a straight line makes sense.
It lies between 1 and -1.
It is positive/negative, the same as b.
If b is 0, then r is 0, because then there is no slope and no association.
If r increases, then the linear association is stronger. If r is exactly -1 or 1, then the linear association is perfectly negative or perfectly positive, without errors.
The r does not depend on units of measurement.

The correlation implies regression towards the mean. This means that when r increases, the association is stronger between the standard deviation of x and the proportion of the standard deviation of y.

The coefficient of determination r² is r-squared and it indicates how good x can predict y. It measures how good the least squares line ŷ = a + b(x) predicts y compared to the prediction of ȳ.

The r² has four elements;

Rule 1: y is predicted, no matter what x is. The best prediction then is the sample mean ȳ.
Rule 2: y is predicted by x. The prediction equation ŷ = a + b(x) predicts y.
E₁ are the errors of rule 1 and E₂ are the errors of rule 2.
The proportional limit of the number of errors is the coefficient of determination: r² = (E₁ - E₂) / E₁ in which E₁ = Σ (y – ȳ)², this is the total sum of squares (TSS). In this E₂ = Σ (y – ŷ)², this is the SSE.

R-squared has a number of characteristics similar to r:

Because r is between 1 and -1, the r² needs to be between 0 and 1.
When SSE = 0, then r² = 1. All points are on the line.
When b = 0, then r² = 0.
The closer r² is to 1, the stronger the linear association is.
The units of measurement and which variable is the explanatory one (x or y), don't matter for r².

The TSS describes the variability in the observations of y. The SSE describes the variability of the prediction equation. The coefficient of determination indicates how many % the variance of a conditional distribution is bigger or smaller than that of a marginal distribution. Because the coefficient of determination doesn't use the original scale but a squared version, some researcher prefer the standard deviation and the correlation because the information they give is easier to interpret.

9.5 How do you predict the slope and the correlation?

For categorical variables, chi-squared test is used to test for independence. For quantitative variables, the confidence test of the slope or of the correlation provides a test for independence.

The assumptions for inference applied to regression are:

Randomization
The mean of y is approximated by E(y) = α + β (x)
The conditional standard deviation σ is equal for every x-value
The conditional distribution of y for every x-value is a normal distribution

The null hypothesis is H₀ : β = 0 (in that case there is no slope and the variables are independent), the alternative hypothesis is H_a : β ≠ 0.

The t-score is found by dividing the sample slope (b) by the standard error of b. The formula is t = b / se. This formula is similar to the formula for every t-score; the estimate minus the null hypothesis (0 in this case), divided by the standard error of the estimate. You can find the P-value for df = n – 2. The standard error of b is:

$se = \frac{s}{\sqrt{\sum (x - \bar{x})^2}}$ in which $s = \sqrt{\frac{SSE}{n-2}}$

The smaller the standard deviation s, the more precise b estimates β.

The correlation is denoted by the Greek letter ρ. The ρ is 0 in the same situations in which β = 0. A test whether H₀ : ρ = 0 is performed in the same way as a test for the slope. For the correlation the formula is:

$t = \frac{r}{\sqrt{\frac{1-r^2}{n-2}}}$

When many variables possibly influence a response variable, these can be portrayed in a correlation matrix. For each variable the correlation can be calculated.

A confidence interval gives more information about a slope than an independence test. The confidence interval of the slope β is: b ± t(se).

Calculating a confidence interval for a correlation is more difficult, because the sampling distribution isn't symmetrical unless ρ = 0.

R² indicates how good x predicts y and it depends on TSS (the variability of the observations of y) and SSE (the variability of the prediction equation). The difference, TSS – SSE, is called the regression sum of squares or the model sum of squares. This difference is the total variability in y that is explained by x using the least squares line.

9.6 What happens when the assumptions of a linear model are violated?

Often the assumption is made that a linear association exists. It's important to check the data in a scatterplot first to see whether a linear model makes sense. If the data is U-shaped, then a straight line doesn't make sense. Making this error could cause the result of an independence test of the slope to be wrong.

Other assumptions are that the distribution is normal and that σ is identical for every x-value. Even when the distribution isn't normal, then the least squares line, the correlation and the coefficient of determination are still useful. But if the standard deviation isn't equal, then other methods are more efficient than the least squares line.

Some outliers have big effects on the regression lines and the correlations. Sometimes outliers need to be taken out. Even one point can have a big influence, particularly for a small sample.

The assumption of randomization, both for x and y, is important for the correlation. If there is no randomization and the variability is small, then the sample correlation will be small and it will underestimate the population correlation. For other aspects of regression, like the slope, the assumption of randomization is less important.

The prediction equation shouldn't be extrapolated and used for (non-existent) data points outside of the range of the observed data. This could have absurd results, like things that are physically impossible.

The theoretical risk exists that the mean of y for a certain value of x doesn't estimate the actual individual observation properly. The Greek letter epsilon (ε) denotes the error term; how much y differs from the mean. The population model is y = α + β x + ε and the sample prediction equation is y = a + bx + e. The ε is also called the population residual.

A model is only an approximation of reality. It shouldn't be too simple. If a model is too simple, it should be adjusted.

Access:

Public

Join WorldSupporter!

Join with a free account for more service, or become a member for full access to exclusives and extra support of WorldSupporter >>

This content is related to:

Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)

Summary of Statistical methods for the social sciences by Agresti, 5th edition, 2018. Summary in English.Read more

3261 keer gelezen

Check more of topic:

Samenvattingen voor psychologie en gedrag

Universiteit Groningen en studieverenigingen

This content is used in:

Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)

Going abroad?

Insure your way around the world

International expat insurances

Travel & Worldsupporter insurances (NL)

Study with summaries

Associate with your Field of Study

Search Summaries or Notes&

Start using Summaries

Add a Summary

Contributions: posts

Help other WorldSupporters with additions, improvements and tips

Add new contribution

Spotlight: topics

Check the related and most recent topics and summaries:

Activities abroad, study fields and working areas:

Samenvattingen voor psychologie en gedrag

Institutions, jobs and organizations:

Universiteit Groningen en studieverenigingen

This content is also used in .....

Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)

Summary of Statistical methods for the social sciences by Agresti, 5th edition, 2018. Summary in English.

What are statistical methods? – Chapter 1

Which kinds of samples and variables are possible? – Chapter 2

What are the main measures and graphs of descriptive statistics? - Chapter 3

What role do probability distributions play in statistical inference? – Chapter 4

How can you make estimates for statistical inference? – Chapter 5

How do you perform significance tests? – Chapter 6

How do you compare two groups in statistics? - Chapter 7

How do you analyze the association between categorical variables? – Chapter 8

How do linear regression and correlation work? – Chapter 9

Which types of multivariate relationships exist? – Chapter 10

What is multiple regression? – Chapter 11

What is ANOVA? – Chapter 12

How does multiple regression with both quantitative and categorical predictors work? – Chapter 13

How do you make a multiple regression model for extreme or strongly correlating data? – Chapter 14

What is logistic regression? – Chapter 15

Check how to use summaries on WorldSupporter.org

Online access to all summaries, study notes en practice exams
How and why use WorldSupporter.org for your summaries and study assistance?
Using and finding summaries, notes and practice exams on JoHo WorldSupporter
Quicklinks to fields of study for summaries and study assistance

Online access to all summaries, study notes en practice exams

Check out: Register with JoHo WorldSupporter: starting page (EN)
Check out: Aanmelden bij JoHo WorldSupporter - startpagina (NL)

How and why use WorldSupporter.org for your summaries and study assistance?

For free use of many of the summaries and study aids provided or collected by your fellow students.
For free use of many of the lecture and study group notes, exam questions and practice questions.
For use of all exclusive summaries and study assistance for those who are member with JoHo WorldSupporter with online access
For compiling your own materials and contributions with relevant study help
For sharing and finding relevant and interesting summaries, documents, notes, blogs, tips, videos, discussions, activities, recipes, side jobs and more.

Using and finding summaries, notes and practice exams on JoHo WorldSupporter

There are several ways to navigate the large amount of summaries, study notes en practice exams on JoHo WorldSupporter.

Use the summaries home pages for your study or field of study
Use the check and search pages for summaries and study aids by field of study, subject or faculty
Use and follow your (study) organization
- by using your own student organization as a starting point, and continuing to follow it, easily discover which study materials are relevant to you
- this option is only available through partner organizations
Check or follow authors or other WorldSupporters
Use the menu above each page to go to the main theme pages for summaries
- Theme pages can be found for international studies as well as Dutch studies

Do you want to share your summaries with JoHo WorldSupporter and its visitors?

Check out: Why and how to add a WorldSupporter contributions
JoHo members: JoHo WorldSupporter members can share content directly and have access to all content: Join JoHo and become a JoHo member
Non-members: When you are not a member you do not have full access, but if you want to share your own content with others you can fill out the contact form

Quicklinks to fields of study for summaries and study assistance

Main summaries home pages:

Main study fields:

Business organization and economics, Communication & Marketing, Education & Pedagogic Sciences, International Relations and Politics, IT and Technology, Law & Administration, Medicine & Health Care, Nature & Environmental Sciences, Psychology and behavioral sciences, Science and academic Research, Society & Culture, Tourisme & Sports

Main study fields NL:

Studies: Bedrijfskunde en economie, communicatie en marketing, geneeskunde en gezondheidszorg, internationale studies en betrekkingen, IT, Logistiek en technologie, maatschappij, cultuur en sociale studies, pedagogiek en onderwijskunde, rechten en bestuurskunde, statistiek, onderzoeksmethoden en SPSS
Studie instellingen: Maatschappij: ISW in Utrecht - Pedagogiek: Groningen, Leiden , Utrecht - Psychologie: Amsterdam, Leiden, Nijmegen, Twente, Utrecht - Recht: Arresten en jurisprudentie, Groningen, Leiden

WorldSupporter: what are the features, functionalities and rules on WorldSupporter.org?

WorldSupporter NL: hoe vind je samenvattingen en studiehulp op WorldSupporter.org en JoHo.org

Summaries and Study Assistance - Start

Submenu: Summaries & Activities

Follow the author: Annemarie JoHo

Annemarie JoHo

Work for WorldSupporter

JoHo can really use your help! Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world

Working for JoHo as a student in Leyden

Parttime werken voor JoHo

Statistics

2141

Search a summary, study help or student organization

Select any filter and click on Search to see results