Discovering statistics using IBM SPSS statistics by Andy Field, fifth edition – Summary chapter 9

Any straight line can be defined by the slope (1) and the point at which the line crosses the vertical axis of the graph (intercept) (2). The general formula for the linear model is the following:

Regression analysis refers to fitting a linear model to data and using it to predict values of an outcome variable (dependent variable) from one or more predictor variables (independent variables). The residuals are the differences between what the model predicts and the actual outcome. The residual sum of squares is used to assess the ‘goodness-of-fit’ of the model on the data. The smaller the residual sum of squares, the better the fit.

Ordinary least squares regression refers to defining the regression models for which the sum of squared errors is the minimum it can be given the data. The sum of squared differences is the total sum of squares and represents how good the mean is as a model of the observed outcome scores. The model sum of squares represents how well the model can predict the data. The larger the model sum of squares, the better the model can predict the data. The residual sum of squares uses the differences between the observed data and the model and shows how much of the data the model cannot predict.

The proportion of improvement due to the model compared to using the mean as a predictor can be calculated using the following formula:

This value represents the amount of variance in the outcome explained by the model relative to how much variation there was to explain. The F-statistic can be calculated using the following formulas:

‘k’ represents the degrees of freedom and denotes the number of predictors.

The F-statistic can also be used t test the significance of  with the null hypothesis being that  is zero. It uses the following formula:

Individual predictors can be tested using the t-statistic.

BIAS IN LINEAR MODELS
An outlier is a case that differs substantially from the main trend in the data. Standardized residuals can be used to check which residuals are unusually large and can be viewed as an outlier. Standardized residuals are residuals converted to z-scores. Standardized residuals greater than 3.29 are considered an outlier (1), if more than 1% of the sample cases have a standardized residual of greater than 2.58, the level of error in the model may be unacceptable (2) and if more than 5% of the cases have standardized residuals with an absolute value greater than 1.96, the model may be a poor representation of the data (3).

The studentized residual is the unstandardized residual divided b an estimate of its standard deviation. These residuals have the same properties as the standardized residuals but provide a more precise estimation of the error variance of a specific case.

Influential cases are cases which exert undue influence over the parameters of the model. In order to test for influential cases, the cases can not be taken into account for the analysis in order to how different the regression coefficients will be.

The adjusted predicted value for a case is the predicted value of the outcome for that case from a model in which the case is excluded. The deleted residual is the difference between the adjusted predicted value and the original observed value. This can be divided by the standard error to give the studentized deleted residual. This residual can be compared across different regression analyses. Cook’s distance is a measure of the overall influence of a case on the model. The leverage assesses the influence of the observed value of the outcome variable over the predicted values.

The average leverage can be calculated in the following way:

The maximum leverage can be calculated using the following formula:

N

If no cases exert undue influence over the model, then all leverage values should be close to the average. Values greater than twice or three times the average should be investigated.

Mahalanobis distances measures the distance of cases from the mean of the predictor variable. These values have a chi-square distribution and using the alpha for that, potential influential cases can be distinguished.

There are several assumptions of the general linear model:

  1. Additivity and linearity
    The outcome variable should be linearly related to any predictors and if there are several predictors, the effects should be added together.
  2. Independent errors
    The residual terms should be uncorrelated for any two observations. This can be tested using the Durbin-Watson test. The statistic ranges from 0 to 4 and a statistic of 2 means the observations are uncorrelated.
  3. Homoscedasticity
    At each level of the independent variable, the variance of the residual terms should be constant. The residual at each level of the independent variable should have the same variance. A violation can be overcome by using a weighted least squares regression.
  4. Normally distributed errors
    The residuals in the model are random, normally distributed variables with a mean of 0.
  5. Predictors are uncorrelated with external variables
    Independent variables should not be correlated with a third variable as this weakens the conclusions you can draw.
  6. Variable types
    All predictor variables must be quantitative or categorical. All outcome variables must be quantitative, continuous and unbounded (take the whole range of values instead of a restricted range).
  7. No perfect multicollinearity
    There should be no perfect relationship between two or more of the independent variables.
  8. Non-zero variance
    The independent variable should have some variation in value.

Violation of most assumptions only has consequences for significance tests or confidence intervals. This has consequences for the generalizability of the findings.

Assessing the accuracy of a model across different samples is known as cross-validation. There are two methods of cross-validation. The adjusted R2 is the amount of variance that would be accounted for if the model had been derived from the population from which the sample was taken. It indicates the loss of predictive power. It uses the following formula:

Another method is data splitting. This involves randomly splitting the sample data, estimating the model in both halves and comparing the resulting models.

SAMPLE SIZE AND THE LINEAR MODEL
The estimate of R is dependent on the number of independent variables and the sample size. This influences the power of the model. The desired effect size and precision influences the sample size.

MULTIPLE REGRESSION
The estimates of the regression coefficients depend upon the variables in the model and the order in which they are entered. Predictors should be chosen based on whether they are sensible and if the predictors have never been added before, they should be chosen based on theoretical importance. Adding predictors that are not relevant will add noise to the model.

The order of predictors does not matter if the predictors are completely uncorrelated. Hierarchical regression is a regression analysis in which predictors are selected based on past work. This uses predictors in order of importance. Forced enter is forcing all predictors into the model simultaneously.

Stepwise regression bases decisions about the order of the predictors jut on a mathematical criterion. The forward method of the stepwise regression in which the computer searches for the best predictor, the predictor that has the highest simple correlation with the outcome and the looks for the next predictor that has the largest semi-partial correlation with the outcome. This way, predictors are chosen. In the backward method the model initially contains all the predictors and the contribution of each is evaluated with the p-value of its t-test. One danger of stepwise regression is overfitting if the sample size is sufficiently large, because then even trivial predictors will be significant.

Suppressor effects refers to a predictor having a significant effect only when another variable is held constant. This can be minimized using the backward method.

The improvements to the model at each stage can be assessed using R-squared. The significance of change of R-squared (the new model versus the old model) can be calculated using the following formula:

Perfect collinearity exists when at least one predictor is a prefect linear combination of the others (e.g. predictor one and two are perfectly correlated). There are three problems if collinearity increases:

  1. Untrustworthy bs
    The standard error of the b coefficients increase if the collinearity increases. This means more variability and a greater chance of unstable predictor equations across samples (1) and coefficients that are unrepresentative of the population (2).
  2. It limits the size of R
    The predictors account for the same variance, so R will not increase. Predictors should account for unique variance.
  3. Importance of predictors
    It is difficult to assess the importance of a predictor when there is multicollinearity.

The variance inflation factor (VIF) indicates whether a predictor has a strong linear relationship with the other predictors. The tolerance statistic (1/VIF) does the same. There are some guidelines:

  1. If the largest VIF is > 10, then there is a strong relationship.
  2. If the average VIF is > 1, then the regression may be biased.
  3. Tolerance below 0.2 indicates a potential problem.

The standardized beta values are relevant for assessing the importance of each predictor. The bigger the absolute value, the more important the predictor is.

It is useful to calculate the average VIF values:

‘k’ denotes the number of predictors.

Image

Access: 
Public

Image

Join WorldSupporter!
This content is used in:

Scientific & Statistical Reasoning – Summary interim exam 3 (UNIVERSITY OF AMSTERDAM)

Summary of Discovering statistics using IBM SPSS statistics by Andy Field - 5th edition

Search a summary

Image

 

 

Contributions: posts

Help other WorldSupporters with additions, improvements and tips

Add new contribution

CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image.

Image

Spotlight: topics

Check the related and most recent topics and summaries:
Institutions, jobs and organizations:
Activities abroad, study fields and working areas:
This content is also used in .....

Image

Check how to use summaries on WorldSupporter.org

Online access to all summaries, study notes en practice exams

How and why use WorldSupporter.org for your summaries and study assistance?

  • For free use of many of the summaries and study aids provided or collected by your fellow students.
  • For free use of many of the lecture and study group notes, exam questions and practice questions.
  • For use of all exclusive summaries and study assistance for those who are member with JoHo WorldSupporter with online access
  • For compiling your own materials and contributions with relevant study help
  • For sharing and finding relevant and interesting summaries, documents, notes, blogs, tips, videos, discussions, activities, recipes, side jobs and more.

Using and finding summaries, notes and practice exams on JoHo WorldSupporter

There are several ways to navigate the large amount of summaries, study notes en practice exams on JoHo WorldSupporter.

  1. Use the summaries home pages for your study or field of study
  2. Use the check and search pages for summaries and study aids by field of study, subject or faculty
  3. Use and follow your (study) organization
    • by using your own student organization as a starting point, and continuing to follow it, easily discover which study materials are relevant to you
    • this option is only available through partner organizations
  4. Check or follow authors or other WorldSupporters
  5. Use the menu above each page to go to the main theme pages for summaries
    • Theme pages can be found for international studies as well as Dutch studies

Do you want to share your summaries with JoHo WorldSupporter and its visitors?

Quicklinks to fields of study for summaries and study assistance

Main summaries home pages:

Main study fields:

Main study fields NL:

Follow the author: JesperN
Work for WorldSupporter

Image

JoHo can really use your help!  Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world

Working for JoHo as a student in Leyden

Parttime werken voor JoHo

Statistics
2985