Summary of Discovering statistics using IBM SPSS statistics by Andy Field - 5th edition
- 3163 keer gelezen
Any straight line can be defined by the slope (1) and the point at which the line crosses the vertical axis of the graph (intercept) (2). The general formula for the linear model is the following:
Regression analysis refers to fitting a linear model to data and using it to predict values of an outcome variable (dependent variable) from one or more predictor variables (independent variables). The residuals are the differences between what the model predicts and the actual outcome. The residual sum of squares is used to assess the ‘goodness-of-fit’ of the model on the data. The smaller the residual sum of squares, the better the fit.
Ordinary least squares regression refers to defining the regression models for which the sum of squared errors is the minimum it can be given the data. The sum of squared differences is the total sum of squares and represents how good the mean is as a model of the observed outcome scores. The model sum of squares represents how well the model can predict the data. The larger the model sum of squares, the better the model can predict the data. The residual sum of squares uses the differences between the observed data and the model and shows how much of the data the model cannot predict.
The proportion of improvement due to the model compared to using the mean as a predictor can be calculated using the following formula:
This value represents the amount of variance in the outcome explained by the model relative to how much variation there was to explain. The F-statistic can be calculated using the following formulas:
‘k’ represents the degrees of freedom and denotes the number of predictors.
The F-statistic can also be used t test the significance of with the null hypothesis being that is zero. It uses the following formula:
Individual predictors can be tested using the t-statistic.
BIAS IN LINEAR MODELS
An outlier is a case that differs substantially from the main trend in the data. Standardized residuals can be used to check which residuals are unusually large and can be viewed as an outlier. Standardized residuals are residuals converted to z-scores. Standardized residuals greater than 3.29 are considered an outlier (1), if more than 1% of the sample cases have a standardized residual of greater than 2.58, the level of error in the model may be unacceptable (2) and if more than 5% of the cases have standardized residuals with an absolute value greater than 1.96, the model may be a poor representation of the data (3).
The studentized residual is the unstandardized residual divided b an estimate of its standard deviation. These residuals have the same properties as the standardized residuals but provide a more precise estimation of the error variance of a specific case.
Influential cases are cases which exert undue influence over the parameters of the model. In order to test for influential cases, the cases can not be taken into account for the analysis in order to how different the regression coefficients will be.
The adjusted predicted value for a case is the predicted value of the outcome for that case from a model in which the case is excluded. The deleted residual is the difference between the adjusted predicted value and the original observed value. This can be divided by the standard error to give the studentized deleted residual. This residual can be compared across different regression analyses. Cook’s distance is a measure of the overall influence of a case on the model. The leverage assesses the influence of the observed value of the outcome variable over the predicted values.
The average leverage can be calculated in the following way:
The maximum leverage can be calculated using the following formula:
N
If no cases exert undue influence over the model, then all leverage values should be close to the average. Values greater than twice or three times the average should be investigated.
Mahalanobis distances measures the distance of cases from the mean of the predictor variable. These values have a chi-square distribution and using the alpha for that, potential influential cases can be distinguished.
There are several assumptions of the general linear model:
Violation of most assumptions only has consequences for significance tests or confidence intervals. This has consequences for the generalizability of the findings.
Assessing the accuracy of a model across different samples is known as cross-validation. There are two methods of cross-validation. The adjusted R2 is the amount of variance that would be accounted for if the model had been derived from the population from which the sample was taken. It indicates the loss of predictive power. It uses the following formula:
Another method is data splitting. This involves randomly splitting the sample data, estimating the model in both halves and comparing the resulting models.
SAMPLE SIZE AND THE LINEAR MODEL
The estimate of R is dependent on the number of independent variables and the sample size. This influences the power of the model. The desired effect size and precision influences the sample size.
MULTIPLE REGRESSION
The estimates of the regression coefficients depend upon the variables in the model and the order in which they are entered. Predictors should be chosen based on whether they are sensible and if the predictors have never been added before, they should be chosen based on theoretical importance. Adding predictors that are not relevant will add noise to the model.
The order of predictors does not matter if the predictors are completely uncorrelated. Hierarchical regression is a regression analysis in which predictors are selected based on past work. This uses predictors in order of importance. Forced enter is forcing all predictors into the model simultaneously.
Stepwise regression bases decisions about the order of the predictors jut on a mathematical criterion. The forward method of the stepwise regression in which the computer searches for the best predictor, the predictor that has the highest simple correlation with the outcome and the looks for the next predictor that has the largest semi-partial correlation with the outcome. This way, predictors are chosen. In the backward method the model initially contains all the predictors and the contribution of each is evaluated with the p-value of its t-test. One danger of stepwise regression is overfitting if the sample size is sufficiently large, because then even trivial predictors will be significant.
Suppressor effects refers to a predictor having a significant effect only when another variable is held constant. This can be minimized using the backward method.
The improvements to the model at each stage can be assessed using R-squared. The significance of change of R-squared (the new model versus the old model) can be calculated using the following formula:
Perfect collinearity exists when at least one predictor is a prefect linear combination of the others (e.g. predictor one and two are perfectly correlated). There are three problems if collinearity increases:
The variance inflation factor (VIF) indicates whether a predictor has a strong linear relationship with the other predictors. The tolerance statistic (1/VIF) does the same. There are some guidelines:
The standardized beta values are relevant for assessing the importance of each predictor. The bigger the absolute value, the more important the predictor is.
It is useful to calculate the average VIF values:
‘k’ denotes the number of predictors.
Join with a free account for more service, or become a member for full access to exclusives and extra support of WorldSupporter >>
This bundle contains everything you need to know for the fifth interim exam for the course "Scientific & Statistical Reasoning" given at the University of Amsterdam. It contains both articles, book chapters and lectures. It consists of the following materials:
...This bundle contains the chapters of the book "Discovering statistics using IBM SPSS statistics by Andy Field, fifth edition". It includes the following chapters:
- 1, 2, 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18.
There are several ways to navigate the large amount of summaries, study notes en practice exams on JoHo WorldSupporter.
Do you want to share your summaries with JoHo WorldSupporter and its visitors?
Main summaries home pages:
Main study fields:
Business organization and economics, Communication & Marketing, Education & Pedagogic Sciences, International Relations and Politics, IT and Technology, Law & Administration, Medicine & Health Care, Nature & Environmental Sciences, Psychology and behavioral sciences, Science and academic Research, Society & Culture, Tourisme & Sports
Main study fields NL:
JoHo can really use your help! Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world
2852 |
Add new contribution