How do you make a multiple regression model for extreme or strongly correlating data? – Chapter 14

14.1 What strategies are available for selecting a model?

Three basic rules for selecting variables to add to a model are:

  1. Select variables that can answer the theoretical purpose (accepting/rejecting the null hypothesis), with sensible control variables and mediating variables

  2. Add enough variables for a good predictive power

  3. Keep the model simple

The explanatory variables should be highly correlated to the response variable but not to each other. Software can test and select explanatory variables. Possible strategies are backward elimination, forward selection and stepwise regression. In backward elimination all possible variables are added, tested for their P-value and then only the significant variables are selected. Forward selection starts from scratch, adding variables with the lowest P-value. Another version of this is stepwise regression, this method removes redundant variables when new variables are added.

Software helps but it's up to the researcher to think and make choices. It also matters whether research is explanatory, starting with a theoretical model with known variables, or whether research is exploratory, openly looking for explanations of a phenomenon.

Several criteria are indications of a good model. To find a model with big power but without an overabundance of variables, the adjusted R2 is used:

The adjusted R2 decreases when an unnecessary variable is added.

Cross-validation continuously checks whether the predicted values are as close as possible to the observed values. The result is the predicted residual sum of squares (PRESS):

If PRESS decreases, the predictions get better. However, this test assumes a normal distribution. A method that can handle other distributions, is the Akaike information criterion (AIC), which selects the model in which ŷi is as close as possible to E(yi). If AIC decreases, the predictions get better.

14.2 How can you tell when a statistical model doesn't fit?

Inference of parameters in a regression model has the following assumptions:

  • The model fits the shape of the data

  • The conditional distribution of y is normal

  • The standard deviation is constant in the range of values of the explanatory variables (this is called homoscedasticity)

  • It's a random sample

Big violations of these assumptions have consequences.

When y has the normal distribution, the residuals do too. A studentized residual is a standardized version: the residual divided by the standard error. This indicates how much of the variabilities in the residuals is explained by the variability of the sampling. A studentized residual exceeding 3 may indicate an outlier.

The randomization in longitudinal research may be limited when the observations within a certain time frame are strongly correlated. A scatterplot of the residuals for the entire time frame can check this. This kind of correlation has a bad influence on most statistics. In longitudinal research, often conducted within social science and in a relatively short time frame, a linear mixed model is used. However, when research involves time series and a longer time frame, then econometric methods are more appropriate.

Lots of statistics measure the effects of outliers. The residuals measures how far y is from the trend. The leverage (h) measures how far the explanatory variables are from their means. When observations have a high residual and high leverage, they have a big influence.

DFBETA describes the effect of an observation on the estimates of the parameters. DFFIT and Cook's distance describe the effect on how a graph fits the data when a certain observation is omitted.

14.3 How do you detect multicollinearity and what are its consequences?

In case of lots of strongly correlated explanatory variables, R² increases only slightly when more variables are added. This is called multicollinearity. It causes the standard errors to increase. Due to the bigger confidence interval, the variance increases. This is measured by the variance inflation factor (VIF), the multiplied increase in variance that is caused by the correlation of the explanatory variables:

Also without the VIF indications of multicollinearity are visible in the equation. What helps against it, is choosing only some variables, converging variables or centering them. With factor analysis new, artificial variables are created from the existing variables, to avoid correlation. But usually factor analysis isn't necessary.

14.4 What are the characteristics of generalized linear models?

Generalized linear models (GLM) is a broad term that includes both regression models with a normal distribution, alternative models for continuous variables without a normal distribution and models with discrete variables.

The outcome of a GLM is often binary, sometimes counts. When the data is very discrete, the GLM uses the gamma distribution.

A GLM has a link function; an equation that connects the mean of the response variable to the explanatory variables: g(μ) = α + β1x1 + β2x2 + … + βpxp. When the data can't be negative, the log link is used for loglinear models: log(μ) = α + β1x1 + β2x2 + … + βpxp. A logistic regression model uses the logit link: g(μ) = log[μ /(1-μ)]. This is useful when μ is between 0 and 1. Most simple is the identity link: g(μ) = μ.

Because a GLM uses the maximum likelihood method, the data doesn't need to have a normal distribution. The maximum likelihood method uses weighted least squares, this method gives more weight to observations with a smaller variability.

A gamma distribution allows different shapes of the standard deviation. This is called heteroscedasticity; the standard deviation increases when the mean increases. Then the variance is ø μ2 and the standard deviation is:

ø is the scale parameter, it indicates the scale that creates the shape of a distribution (for instance a bell).

14.5 What is polynomial regression?

When a graph is very nonlinear, but for instance curvilinear, the polynomial regression function is used: E(y) = α + β1x + β2x2 in which the highest power is called the degree. A polynomial regression function can express a quadratic regression model, a parabola.

A cubic function is a polynomial function with three degrees, but usually a function with two degrees suffices. For a straight line the slope maintains the same shape, but in a polynomial function it changes. When the coefficient of x² is positive, the data will be shaped like an inverted U. When the coefficient is negative, the data will be shaped like an U. The highest or lowest point of the parabola is: x = – β1 / 2(β2).

In these kind of models R² is the proportional decrease of the error estimates by using a quadratic function instead of a linear function. A comparison of R² and r² indicates how much better of a fit the quadratic function is. The null hypothesis can be tested that a quadratic function doesn't add to the model: H0: β2 = 0.

Conclusions should be made carefully, sometimes other shapes are possible too. Parsimony should be the goal, models shouldn't have more parameters than necessary.

14.6 What do exponential regression and log transforms look like?

An exponential regression function is E(y) = α βx. It only has positive values and either increases or decreases endlessly. The logarithm of the mean is: log(μ) = log α + (log β)x. In this model, β is the multiplied change of y for an increase of 1 in x. When a graph needs to be transformed into a linear function, then log transforms can be used, they linearize the relationship.

14.7 What are robust variance and nonparametric regression?

Robust variance allows mending regression models so they can handle violations of assumptions. This method uses the least squares line but doesn't assume that the variance in finding standard errors is constant. Instead, the standard errors are adjusted to the variability of the sample data. This is called the sandwich estimate or robust standard error estimate. Software can sometimes calculate these standard errors, then they can be compared to the regular standard errors. If they differ a lot, the assumptions are badly violated. The robust variance can be applied to strongly correlated data like clusters. Then generalized estimating equations (GEE) are used; estimates of equations with the maximum likelihood but without the parametric probability distribution that usually goes along with correlations.

A recently developed nonparametric method is generalized additive modelling. This is a generalization of the generalized linear model. Smoothing a curve exposes larger trends. Popular smoothers are LOESS and kernel.

Image

Access: 
Public

Image

Join WorldSupporter!
This content is related to:
Search a summary

Image

 

 

Contributions: posts

Help other WorldSupporters with additions, improvements and tips

Add new contribution

CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image.

Image

Spotlight: topics

Check the related and most recent topics and summaries:
Institutions, jobs and organizations:
Activities abroad, study fields and working areas:

Image

Check how to use summaries on WorldSupporter.org

Online access to all summaries, study notes en practice exams

How and why use WorldSupporter.org for your summaries and study assistance?

  • For free use of many of the summaries and study aids provided or collected by your fellow students.
  • For free use of many of the lecture and study group notes, exam questions and practice questions.
  • For use of all exclusive summaries and study assistance for those who are member with JoHo WorldSupporter with online access
  • For compiling your own materials and contributions with relevant study help
  • For sharing and finding relevant and interesting summaries, documents, notes, blogs, tips, videos, discussions, activities, recipes, side jobs and more.

Using and finding summaries, notes and practice exams on JoHo WorldSupporter

There are several ways to navigate the large amount of summaries, study notes en practice exams on JoHo WorldSupporter.

  1. Use the summaries home pages for your study or field of study
  2. Use the check and search pages for summaries and study aids by field of study, subject or faculty
  3. Use and follow your (study) organization
    • by using your own student organization as a starting point, and continuing to follow it, easily discover which study materials are relevant to you
    • this option is only available through partner organizations
  4. Check or follow authors or other WorldSupporters
  5. Use the menu above each page to go to the main theme pages for summaries
    • Theme pages can be found for international studies as well as Dutch studies

Do you want to share your summaries with JoHo WorldSupporter and its visitors?

Quicklinks to fields of study for summaries and study assistance

Main summaries home pages:

Main study fields:

Main study fields NL:

Follow the author: Annemarie JoHo
Work for WorldSupporter

Image

JoHo can really use your help!  Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world

Working for JoHo as a student in Leyden

Parttime werken voor JoHo

Statistics
1849