How do linear regression and correlation work? – Chapter 9

9.1 What are linear associations?

Regression analysis is the process of researching associations between quantitative response variables and explanatory variables. It has three aspects: 1) investigating whether an association exists, 2) determining the strength of the association and 3) making a regression equation to predict the value of the response variable using the explanatory variable.

The response variable is denoted as y and the explanatory variable as x. A linear function means that there is a straight line throughout the points of data in a graph. A linear function is: y = α + β (x). In this alpha (α) is the y-intercept and beta (β) is the slope.

The x-axis is the horizontal axis and the y-axis is the vertical axis. The origin is the point where x and y are both 0.

The y-intercept is the value of y when x = 0. In that case β(x) equals 0, only y = α remains. The y-intercept is the location where the line starts on the y-axis.

The slope (β) indicates the change in y for a change of 1 in x. So the slope is an indication of how steep the line is. Generally when β increases, the line becomes steeper.

When β is positive, then y increases when x increases (a positive relationship). When β is negative, then y decreases when x increases (a negative relationship). When β = 0, the value of y is constant and doesn't change when x changes. This results in a horizontal line and means that the variables are independent.

A linear function is an example of a model; a simplified approximation of the association between variables in the population. A model can be good or bad. A regression model usually means a model more complex than a linear function.

9.2 What is the least squares prediction equation?

In regression analysis α and β are regarded as unknown parameters that can be estimated using the available data. Each value of y is a point in a graph and can be written with its coordinates (x, y). A graph is used as a visual check whether it makes sense to make a linear function. If the data is U-shaped, a straight line doesn't make sense.

The variable y is estimated by ŷ. The equation is estimated by the prediction equation: ŷ = a + b(x). This line is the best line; the line closest to all data points. In the prediction equation is a = ȳ – bx̄ and:

 

A regression outlier is a data point far outside the trend of the other data points. It's called influential when removing it would cause a big change for the prediction equation. The effect is smaller for large datasets. Sometimes it's better for the prediction equation to leave the outlier out and explain this when reporting the results.

The prediction equation estimates the values of y, but they won't completely match the actual observed values. Studying the differences indicates the quality of the prediction equation. The difference between an observed value (y) and the predicted value (ŷ) is called a residual, this is y – ŷ. When the observed value is bigger, the residual is positive. When the observed value is smaller, the residual is negative. The smaller the absolute value of the residual, the better the prediction is.

The best prediction equation has the smallest residuals. To find it, the SSE (sum of squared errors) is used. SSE tells how good or bad ŷ is in predicting y. The formula of the SSE is: Σ (y – ŷ)2.

The least quares estimates a and b in the least squares line ŷ = a + b(x) have the values for which SSE is as small as possible. It results in the best possible line that can be drawn. In most software SSE is called the residual sum of squares.

The SSE of the best regression line has both negative and positive residuals (that all become positive by squaring them), of which the sum and the mean are 0. The best line intersects the mean of x and the mean of y, so it intersects (x̄, ȳ), the center of the data.

9.3 What is a linear regression model?

In y = a + b(x) there is the same sort of y-value for every x-value. This is a deterministic model. Usually this isn't how reality works. For instance when age (x) predicts the number of relationships someone has been in (y), then not everybody has had the same number at age 22. In that case a probabilistic model is better; a model that allows variability in the y-value. The data can then be visualized in a conditional distribution, a distribution that has the extra condition that x has a certain value.

A probabilistic model shows the mean of the y-values, not the actual values. The formula of a conditional distribution is E(y) = α + β (x). The symbol E means the expected value. When for instance people aged 22 have had different numbers of relationships, the probabilistic model can predict the mean number of relationships.

A regression function is a mathematical equation that describes how the mean of the response variable changes when the value of the explanatory variable changes.

Another parameter of the linear regression model is the standard deviation of a conditional distribution, σ. This parameter measures the variability of the y-values for all person with a certain x-value. This is called the conditional standard deviation.

Because the real standard deviation is unknown, the sample standard deviation is used:

The assumption is made that the standard deviation is the same for every x-value. If the variability would differ per distribution of a value of x, then s would indicate the mean variability. The Mean Square Error (MSE) is s squared. In software the conditional standard deviation has several names: Standard error of the estimate (SPSS), Residual standard error (R ), Root MSE (Stata and SAS).

The degrees of freedom for a regression function are df = n – p, in which p is the number of unknown parameters. In E(y) = α + β (x) there are two unknown parameters (α and β) so df = n – 2.

The conditional standard deviation depends both on y and on x and is written as σy|x (for the population) and sy|x (for the sample), shortened σ and s. In a marginal distribution the standard deviation only depends on y, so this is written as σy (for the population) and sy (for the sample). The formula of a point estimate of the standard deviation is:

The upper part in the root, Σ (y – ȳ)2, is the total sum of squares. The marginal standard deviation (independent of x) and the conditional standard deviation (dependent on a certain x) can be different.

9.4 How does the correlation measure the association of a linear function?

The slope tells how steep a line is and whether the association is negative or positive, but it doesn't tell how strong the association between two variables is.

The association is measured by the correlation (r). This is standardized version of the slope. It is also called the standardized regression coefficient or Pearson correlation. The correlation is the value that the slope would have if the variables would have an equal variability. The formula is:

In regard to the slope (b), the r is: r = (sx / sy) b, in which sx is the standard deviation of x and sy is the standard deviation of y.

The correlation has the following characteristics:

  • It can only be used if a straight line makes sense.

  • It lies between 1 and -1.

  • It is positive/negative, the same as b.

  • If b is 0, then r is 0, because then there is no slope and no association.

  • If r increases, then the linear association is stronger. If r is exactly -1 or 1, then the linear association is perfectly negative or perfectly positive, without errors.

  • The r does not depend on units of measurement.

The correlation implies regression towards the mean. This means that when r increases, the association is stronger between the standard deviation of x and the proportion of the standard deviation of y.

The coefficient of determination r2 is r-squared and it indicates how good x can predict y. It measures how good the least squares line ŷ = a + b(x) predicts y compared to the prediction of ȳ.

The r2 has four elements;

  1. Rule 1: y is predicted, no matter what x is. The best prediction then is the sample mean ȳ.

  2. Rule 2: y is predicted by x. The prediction equation ŷ = a + b(x) predicts y.

  3. E1 are the errors of rule 1 and E2 are the errors of rule 2.

  4. The proportional limit of the number of errors is the coefficient of determination: r2 = (E1 - E2) / E1 in which E1 = Σ (y – ȳ)2, this is the total sum of squares (TSS). In this E2 = Σ (y – ŷ)2, this is the SSE.

R-squared has a number of characteristics similar to r:

  • Because r is between 1 and -1, the r2 needs to be between 0 and 1.

  • When SSE = 0, then r2 = 1. All points are on the line.

  • When b = 0, then r2 = 0.

  • The closer r2 is to 1, the stronger the linear association is.

  • The units of measurement and which variable is the explanatory one (x or y), don't matter for r2.

The TSS describes the variability in the observations of y. The SSE describes the variability of the prediction equation. The coefficient of determination indicates how many % the variance of a conditional distribution is bigger or smaller than that of a marginal distribution. Because the coefficient of determination doesn't use the original scale but a squared version, some researcher prefer the standard deviation and the correlation because the information they give is easier to interpret.

9.5 How do you predict the slope and the correlation?

For categorical variables, chi-squared test is used to test for independence. For quantitative variables, the confidence test of the slope or of the correlation provides a test for independence.

The assumptions for inference applied to regression are:

  • Randomization

  • The mean of y is approximated by E(y) = α + β (x)

  • The conditional standard deviation σ is equal for every x-value

  • The conditional distribution of y for every x-value is a normal distribution

The null hypothesis is H0 : β = 0 (in that case there is no slope and the variables are independent), the alternative hypothesis is Ha : β ≠ 0.

The t-score is found by dividing the sample slope (b) by the standard error of b. The formula is t = b / se. This formula is similar to the formula for every t-score; the estimate minus the null hypothesis (0 in this case), divided by the standard error of the estimate. You can find the P-value for df = n – 2. The standard error of b is:

 in which 

The smaller the standard deviation s, the more precise b estimates β.

The correlation is denoted by the Greek letter ρ. The ρ is 0 in the same situations in which β = 0. A test whether H0 : ρ = 0 is performed in the same way as a test for the slope. For the correlation the formula is:

When many variables possibly influence a response variable, these can be portrayed in a correlation matrix. For each variable the correlation can be calculated.

A confidence interval gives more information about a slope than an independence test. The confidence interval of the slope β is: b ± t(se).

Calculating a confidence interval for a correlation is more difficult, because the sampling distribution isn't symmetrical unless ρ = 0.

R2 indicates how good x predicts y and it depends on TSS (the variability of the observations of y) and SSE (the variability of the prediction equation). The difference, TSS – SSE, is called the regression sum of squares or the model sum of squares. This difference is the total variability in y that is explained by x using the least squares line.

9.6 What happens when the assumptions of a linear model are violated?

Often the assumption is made that a linear association exists. It's important to check the data in a scatterplot first to see whether a linear model makes sense. If the data is U-shaped, then a straight line doesn't make sense. Making this error could cause the result of an independence test of the slope to be wrong.

Other assumptions are that the distribution is normal and that σ is identical for every x-value. Even when the distribution isn't normal, then the least squares line, the correlation and the coefficient of determination are still useful. But if the standard deviation isn't equal, then other methods are more efficient than the least squares line.

Some outliers have big effects on the regression lines and the correlations. Sometimes outliers need to be taken out. Even one point can have a big influence, particularly for a small sample.

The assumption of randomization, both for x and y, is important for the correlation. If there is no randomization and the variability is small, then the sample correlation will be small and it will underestimate the population correlation. For other aspects of regression, like the slope, the assumption of randomization is less important.

The prediction equation shouldn't be extrapolated and used for (non-existent) data points outside of the range of the observed data. This could have absurd results, like things that are physically impossible.

The theoretical risk exists that the mean of y for a certain value of x doesn't estimate the actual individual observation properly. The Greek letter epsilon (ε) denotes the error term; how much y differs from the mean. The population model is y = α + β x + ε and the sample prediction equation is y = a + bx + e. The ε is also called the population residual.

A model is only an approximation of reality. It shouldn't be too simple. If a model is too simple, it should be adjusted.

 

Image

Access: 
Public

Image

Join WorldSupporter!
This content is related to:
Search a summary

Image

 

 

Contributions: posts

Help other WorldSupporters with additions, improvements and tips

Add new contribution

CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image.

Image

Spotlight: topics

Check the related and most recent topics and summaries:
Institutions, jobs and organizations:
Activity abroad, study field of working area:

Image

Check how to use summaries on WorldSupporter.org

Online access to all summaries, study notes en practice exams

How and why use WorldSupporter.org for your summaries and study assistance?

  • For free use of many of the summaries and study aids provided or collected by your fellow students.
  • For free use of many of the lecture and study group notes, exam questions and practice questions.
  • For use of all exclusive summaries and study assistance for those who are member with JoHo WorldSupporter with online access
  • For compiling your own materials and contributions with relevant study help
  • For sharing and finding relevant and interesting summaries, documents, notes, blogs, tips, videos, discussions, activities, recipes, side jobs and more.

Using and finding summaries, notes and practice exams on JoHo WorldSupporter

There are several ways to navigate the large amount of summaries, study notes en practice exams on JoHo WorldSupporter.

  1. Use the summaries home pages for your study or field of study
  2. Use the check and search pages for summaries and study aids by field of study, subject or faculty
  3. Use and follow your (study) organization
    • by using your own student organization as a starting point, and continuing to follow it, easily discover which study materials are relevant to you
    • this option is only available through partner organizations
  4. Check or follow authors or other WorldSupporters
  5. Use the menu above each page to go to the main theme pages for summaries
    • Theme pages can be found for international studies as well as Dutch studies

Do you want to share your summaries with JoHo WorldSupporter and its visitors?

Quicklinks to fields of study for summaries and study assistance

Main summaries home pages:

Main study fields:

Main study fields NL:

Follow the author: Annemarie JoHo
Work for WorldSupporter

Image

JoHo can really use your help!  Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world

Working for JoHo as a student in Leyden

Parttime werken voor JoHo

Statistics
1929