How do you make a multiple regression model for extreme or strongly correlating data? – Chapter 14

14.1 What strategies are available for selecting a model?
14.2 How can you tell when a statistical model doesn't fit?
14.3 How do you detect multicollinearity and what are its consequences?
14.4 What are the characteristics of generalized linear models?
14.5 What is polynomial regression?
14.6 What do exponential regression and log transforms look like?
14.7 What are robust variance and nonparametric regression?

14.1 What strategies are available for selecting a model?

Three basic rules for selecting variables to add to a model are:

Select variables that can answer the theoretical purpose (accepting/rejecting the null hypothesis), with sensible control variables and mediating variables
Add enough variables for a good predictive power
Keep the model simple

The explanatory variables should be highly correlated to the response variable but not to each other. Software can test and select explanatory variables. Possible strategies are backward elimination, forward selection and stepwise regression. In backward elimination all possible variables are added, tested for their P-value and then only the significant variables are selected. Forward selection starts from scratch, adding variables with the lowest P-value. Another version of this is stepwise regression, this method removes redundant variables when new variables are added.

Software helps but it's up to the researcher to think and make choices. It also matters whether research is explanatory, starting with a theoretical model with known variables, or whether research is exploratory, openly looking for explanations of a phenomenon.

Several criteria are indications of a good model. To find a model with big power but without an overabundance of variables, the adjusted R² is used:

$R_{adj}^2 = \frac{s_y^2-s^2}{s_y^2}$

The adjusted R² decreases when an unnecessary variable is added.

Cross-validation continuously checks whether the predicted values are as close as possible to the observed values. The result is the predicted residual sum of squares (PRESS):

$PRESS = \sum (y_i - \hat{y}_i)^2$

If PRESS decreases, the predictions get better. However, this test assumes a normal distribution. A method that can handle other distributions, is the Akaike information criterion (AIC), which selects the model in which ŷ_i is as close as possible to E(y_i). If AIC decreases, the predictions get better.

14.2 How can you tell when a statistical model doesn't fit?

Inference of parameters in a regression model has the following assumptions:

The model fits the shape of the data
The conditional distribution of y is normal
The standard deviation is constant in the range of values of the explanatory variables (this is called homoscedasticity)
It's a random sample

Big violations of these assumptions have consequences.

When y has the normal distribution, the residuals do too. A studentized residual is a standardized version: the residual divided by the standard error. This indicates how much of the variabilities in the residuals is explained by the variability of the sampling. A studentized residual exceeding 3 may indicate an outlier.

The randomization in longitudinal research may be limited when the observations within a certain time frame are strongly correlated. A scatterplot of the residuals for the entire time frame can check this. This kind of correlation has a bad influence on most statistics. In longitudinal research, often conducted within social science and in a relatively short time frame, a linear mixed model is used. However, when research involves time series and a longer time frame, then econometric methods are more appropriate.

Lots of statistics measure the effects of outliers. The residuals measures how far y is from the trend. The leverage (h) measures how far the explanatory variables are from their means. When observations have a high residual and high leverage, they have a big influence.

DFBETA describes the effect of an observation on the estimates of the parameters. DFFIT and Cook's distance describe the effect on how a graph fits the data when a certain observation is omitted.

14.3 How do you detect multicollinearity and what are its consequences?

In case of lots of strongly correlated explanatory variables, R² increases only slightly when more variables are added. This is called multicollinearity. It causes the standard errors to increase. Due to the bigger confidence interval, the variance increases. This is measured by the variance inflation factor (VIF), the multiplied increase in variance that is caused by the correlation of the explanatory variables:

$VIF = 1/(1-R_j^2)$

Also without the VIF indications of multicollinearity are visible in the equation. What helps against it, is choosing only some variables, converging variables or centering them. With factor analysis new, artificial variables are created from the existing variables, to avoid correlation. But usually factor analysis isn't necessary.

14.4 What are the characteristics of generalized linear models?

Generalized linear models (GLM) is a broad term that includes both regression models with a normal distribution, alternative models for continuous variables without a normal distribution and models with discrete variables.

The outcome of a GLM is often binary, sometimes counts. When the data is very discrete, the GLM uses the gamma distribution.

A GLM has a link function; an equation that connects the mean of the response variable to the explanatory variables: g(μ) = α + β₁x₁ + β₂x₂ + … + β_px_p. When the data can't be negative, the log link is used for loglinear models: log(μ) = α + β₁x₁ + β₂x₂ + … + β_px_p. A logistic regression model uses the logit link: g(μ) = log[μ /(1-μ)]. This is useful when μ is between 0 and 1. Most simple is the identity link: g(μ) = μ.

Because a GLM uses the maximum likelihood method, the data doesn't need to have a normal distribution. The maximum likelihood method uses weighted least squares, this method gives more weight to observations with a smaller variability.

A gamma distribution allows different shapes of the standard deviation. This is called heteroscedasticity; the standard deviation increases when the mean increases. Then the variance is ø μ² and the standard deviation is:

$\sqrt{\phi}\mu$

ø is the scale parameter, it indicates the scale that creates the shape of a distribution (for instance a bell).

14.5 What is polynomial regression?

When a graph is very nonlinear, but for instance curvilinear, the polynomial regression function is used: E(y) = α + β₁x + β₂x² in which the highest power is called the degree. A polynomial regression function can express a quadratic regression model, a parabola.

A cubic function is a polynomial function with three degrees, but usually a function with two degrees suffices. For a straight line the slope maintains the same shape, but in a polynomial function it changes. When the coefficient of x² is positive, the data will be shaped like an inverted U. When the coefficient is negative, the data will be shaped like an U. The highest or lowest point of the parabola is: x = – β₁ / 2(β₂).

In these kind of models R² is the proportional decrease of the error estimates by using a quadratic function instead of a linear function. A comparison of R² and r² indicates how much better of a fit the quadratic function is. The null hypothesis can be tested that a quadratic function doesn't add to the model: H₀: β₂ = 0.

Conclusions should be made carefully, sometimes other shapes are possible too. Parsimony should be the goal, models shouldn't have more parameters than necessary.

14.6 What do exponential regression and log transforms look like?

An exponential regression function is E(y) = α β^x. It only has positive values and either increases or decreases endlessly. The logarithm of the mean is: log(μ) = log α + (log β)x. In this model, β is the multiplied change of y for an increase of 1 in x. When a graph needs to be transformed into a linear function, then log transforms can be used, they linearize the relationship.

14.7 What are robust variance and nonparametric regression?

Robust variance allows mending regression models so they can handle violations of assumptions. This method uses the least squares line but doesn't assume that the variance in finding standard errors is constant. Instead, the standard errors are adjusted to the variability of the sample data. This is called the sandwich estimate or robust standard error estimate. Software can sometimes calculate these standard errors, then they can be compared to the regular standard errors. If they differ a lot, the assumptions are badly violated. The robust variance can be applied to strongly correlated data like clusters. Then generalized estimating equations (GEE) are used; estimates of equations with the maximum likelihood but without the parametric probability distribution that usually goes along with correlations.

A recently developed nonparametric method is generalized additive modelling. This is a generalization of the generalized linear model. Smoothing a curve exposes larger trends. Popular smoothers are LOESS and kernel.

Access:

Public

Join WorldSupporter!

Join with a free account for more service, or become a member for full access to exclusives and extra support of WorldSupporter >>

This content is related to:

Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)

Summary of Statistical methods for the social sciences by Agresti, 5th edition, 2018. Summary in English.Read more

3261 reads

Check more of topic:

Samenvattingen voor psychologie en gedrag

Universiteit Groningen en studieverenigingen

This content is used in:

Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)

Going abroad?

Insure your way around the world

International expat insurances

Travel & Worldsupporter insurances (NL)

Study with summaries

Associate with your Field of Study

Search Summaries or Notes&

Start using Summaries

Add a Summary

Contributions: posts

Help other WorldSupporters with additions, improvements and tips

Add new contribution

Spotlight: topics

Check the related and most recent topics and summaries:

Activities abroad, study fields and working areas:

Samenvattingen voor psychologie en gedrag

Institutions, jobs and organizations:

Universiteit Groningen en studieverenigingen

This content is also used in .....

Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)

Summary of Statistical methods for the social sciences by Agresti, 5th edition, 2018. Summary in English.

What are statistical methods? – Chapter 1

Which kinds of samples and variables are possible? – Chapter 2

What are the main measures and graphs of descriptive statistics? - Chapter 3

What role do probability distributions play in statistical inference? – Chapter 4

How can you make estimates for statistical inference? – Chapter 5

How do you perform significance tests? – Chapter 6

How do you compare two groups in statistics? - Chapter 7

How do you analyze the association between categorical variables? – Chapter 8

How do linear regression and correlation work? – Chapter 9

Which types of multivariate relationships exist? – Chapter 10

What is multiple regression? – Chapter 11

What is ANOVA? – Chapter 12

How does multiple regression with both quantitative and categorical predictors work? – Chapter 13

How do you make a multiple regression model for extreme or strongly correlating data? – Chapter 14

What is logistic regression? – Chapter 15

Check how to use summaries on WorldSupporter.org

Online access to all summaries, study notes en practice exams
How and why use WorldSupporter.org for your summaries and study assistance?
Using and finding summaries, notes and practice exams on JoHo WorldSupporter
Quicklinks to fields of study for summaries and study assistance

Online access to all summaries, study notes en practice exams

Check out: Register with JoHo WorldSupporter: starting page (EN)
Check out: Aanmelden bij JoHo WorldSupporter - startpagina (NL)

How and why use WorldSupporter.org for your summaries and study assistance?

For free use of many of the summaries and study aids provided or collected by your fellow students.
For free use of many of the lecture and study group notes, exam questions and practice questions.
For use of all exclusive summaries and study assistance for those who are member with JoHo WorldSupporter with online access
For compiling your own materials and contributions with relevant study help
For sharing and finding relevant and interesting summaries, documents, notes, blogs, tips, videos, discussions, activities, recipes, side jobs and more.

Using and finding summaries, notes and practice exams on JoHo WorldSupporter

There are several ways to navigate the large amount of summaries, study notes en practice exams on JoHo WorldSupporter.

Use the summaries home pages for your study or field of study
Use the check and search pages for summaries and study aids by field of study, subject or faculty
Use and follow your (study) organization
- by using your own student organization as a starting point, and continuing to follow it, easily discover which study materials are relevant to you
- this option is only available through partner organizations
Check or follow authors or other WorldSupporters
Use the menu above each page to go to the main theme pages for summaries
- Theme pages can be found for international studies as well as Dutch studies

Do you want to share your summaries with JoHo WorldSupporter and its visitors?

Check out: Why and how to add a WorldSupporter contributions
JoHo members: JoHo WorldSupporter members can share content directly and have access to all content: Join JoHo and become a JoHo member
Non-members: When you are not a member you do not have full access, but if you want to share your own content with others you can fill out the contact form

Quicklinks to fields of study for summaries and study assistance

Main summaries home pages:

Main study fields:

Business organization and economics, Communication & Marketing, Education & Pedagogic Sciences, International Relations and Politics, IT and Technology, Law & Administration, Medicine & Health Care, Nature & Environmental Sciences, Psychology and behavioral sciences, Science and academic Research, Society & Culture, Tourisme & Sports

Main study fields NL:

Studies: Bedrijfskunde en economie, communicatie en marketing, geneeskunde en gezondheidszorg, internationale studies en betrekkingen, IT, Logistiek en technologie, maatschappij, cultuur en sociale studies, pedagogiek en onderwijskunde, rechten en bestuurskunde, statistiek, onderzoeksmethoden en SPSS
Studie instellingen: Maatschappij: ISW in Utrecht - Pedagogiek: Groningen, Leiden , Utrecht - Psychologie: Amsterdam, Leiden, Nijmegen, Twente, Utrecht - Recht: Arresten en jurisprudentie, Groningen, Leiden

WorldSupporter: what are the features, functionalities and rules on WorldSupporter.org?

WorldSupporter NL: hoe vind je samenvattingen en studiehulp op WorldSupporter.org en JoHo.org

Summaries and Study Assistance - Start

Submenu: Summaries & Activities

Follow the author: Annemarie JoHo

Annemarie JoHo

Work for WorldSupporter

JoHo can really use your help! Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world

Working for JoHo as a student in Leyden

Parttime werken voor JoHo

Statistics

1963

Search a summary, study help or student organization

Select any filter and click on Search to see results