How do you make a multiple regression model for extreme or strongly correlating data? – Chapter 14

14.1 What strategies are available for selecting a model?
14.2 How can you tell when a statistical model doesn't fit?
14.3 How do you detect multicollinearity and what are its consequences?
14.4 What are the characteristics of generalized linear models?
14.5 What is polynomial regression?
14.6 What do exponential regression and log transforms look like?
14.7 What are robust variance and nonparametric regression?

14.1 What strategies are available for selecting a model?

Three basic rules for selecting variables to add to a model are:

Select variables that can answer the theoretical purpose (accepting/rejecting the null hypothesis), with sensible control variables and mediating variables
Add enough variables for a good predictive power
Keep the model simple

The explanatory variables should be highly correlated to the response variable but not to each other. Software can test and select explanatory variables. Possible strategies are backward elimination, forward selection and stepwise regression. In backward elimination all possible variables are added, tested for their P-value and then only the significant variables are selected. Forward selection starts from scratch, adding variables with the lowest P-value. Another version of this is stepwise regression, this method removes redundant variables when new variables are added.

Software helps but it's up to the researcher to think and make choices. It also matters whether research is explanatory, starting with a theoretical model with known variables, or whether research is exploratory, openly looking for explanations of a phenomenon.

Several criteria are indications of a good model. To find a model with big power but without an overabundance of variables, the adjusted R² is used:

$R_{adj}^2 = \frac{s_y^2-s^2}{s_y^2}$

The adjusted R² decreases when an unnecessary variable is added.

Cross-validation continuously checks whether the predicted values are as close as possible to the observed values. The result is the predicted residual sum of squares (PRESS):

$PRESS = \sum (y_i - \hat{y}_i)^2$

If PRESS decreases, the predictions get better. However, this test assumes a normal distribution. A method that can handle other distributions, is the Akaike information criterion (AIC), which selects the model in which ŷ_i is as close as possible to E(y_i). If AIC decreases, the predictions get better.

14.2 How can you tell when a statistical model doesn't fit?

Inference of parameters in a regression model has the following assumptions:

The model fits the shape of the data
The conditional distribution of y is normal
The standard deviation is constant in the range of values of the explanatory variables (this is called homoscedasticity)
It's a random sample

Big violations of these assumptions have consequences.

When y has the normal distribution, the residuals do too. A studentized residual is a standardized version: the residual divided by the standard error. This indicates how much of the variabilities in the residuals is explained by the variability of the sampling. A studentized residual exceeding 3 may indicate an outlier.

The randomization in longitudinal research may be limited when the observations within a certain time frame are strongly correlated. A scatterplot of the residuals for the entire time frame can check this. This kind of correlation has a bad influence on most statistics. In longitudinal research, often conducted within social science and in a relatively short time frame, a linear mixed model is used. However, when research involves time series and a longer time frame, then econometric methods are more appropriate.

Lots of statistics measure the effects of outliers. The residuals measures how far y is from the trend. The leverage (h) measures how far the explanatory variables are from their means. When observations have a high residual and high leverage, they have a big influence.

DFBETA describes the effect of an observation on the estimates of the parameters. DFFIT and Cook's distance describe the effect on how a graph fits the data when a certain observation is omitted.

14.3 How do you detect multicollinearity and what are its consequences?

In case of lots of strongly correlated explanatory variables, R² increases only slightly when more variables are added. This is called multicollinearity. It causes the standard errors to increase. Due to the bigger confidence interval, the variance increases. This is measured by the variance inflation factor (VIF), the multiplied increase in variance that is caused by the correlation of the explanatory variables:

$VIF = 1/(1-R_j^2)$

Also without the VIF indications of multicollinearity are visible in the equation. What helps against it, is choosing only some variables, converging variables or centering them. With factor analysis new, artificial variables are created from the existing variables, to avoid correlation. But usually factor analysis isn't necessary.

14.4 What are the characteristics of generalized linear models?

Generalized linear models (GLM) is a broad term that includes both regression models with a normal distribution, alternative models for continuous variables without a normal distribution and models with discrete variables.

The outcome of a GLM is often binary, sometimes counts. When the data is very discrete, the GLM uses the gamma distribution.

A GLM has a link function; an equation that connects the mean of the response variable to the explanatory variables: g(μ) = α + β₁x₁ + β₂x₂ + … + β_px_p. When the data can't be negative, the log link is used for loglinear models: log(μ) = α + β₁x₁ + β₂x₂ + … + β_px_p. A logistic regression model uses the logit link: g(μ) = log[μ /(1-μ)]. This is useful when μ is between 0 and 1. Most simple is the identity link: g(μ) = μ.

Because a GLM uses the maximum likelihood method, the data doesn't need to have a normal distribution. The maximum likelihood method uses weighted least squares, this method gives more weight to observations with a smaller variability.

A gamma distribution allows different shapes of the standard deviation. This is called heteroscedasticity; the standard deviation increases when the mean increases. Then the variance is ø μ² and the standard deviation is:

$\sqrt{\phi}\mu$

ø is the scale parameter, it indicates the scale that creates the shape of a distribution (for instance a bell).

14.5 What is polynomial regression?

When a graph is very nonlinear, but for instance curvilinear, the polynomial regression function is used: E(y) = α + β₁x + β₂x² in which the highest power is called the degree. A polynomial regression function can express a quadratic regression model, a parabola.

A cubic function is a polynomial function with three degrees, but usually a function with two degrees suffices. For a straight line the slope maintains the same shape, but in a polynomial function it changes. When the coefficient of x² is positive, the data will be shaped like an inverted U. When the coefficient is negative, the data will be shaped like an U. The highest or lowest point of the parabola is: x = – β₁ / 2(β₂).

In these kind of models R² is the proportional decrease of the error estimates by using a quadratic function instead of a linear function. A comparison of R² and r² indicates how much better of a fit the quadratic function is. The null hypothesis can be tested that a quadratic function doesn't add to the model: H₀: β₂ = 0.

Conclusions should be made carefully, sometimes other shapes are possible too. Parsimony should be the goal, models shouldn't have more parameters than necessary.

14.6 What do exponential regression and log transforms look like?

An exponential regression function is E(y) = α β^x. It only has positive values and either increases or decreases endlessly. The logarithm of the mean is: log(μ) = log α + (log β)x. In this model, β is the multiplied change of y for an increase of 1 in x. When a graph needs to be transformed into a linear function, then log transforms can be used, they linearize the relationship.

14.7 What are robust variance and nonparametric regression?

Robust variance allows mending regression models so they can handle violations of assumptions. This method uses the least squares line but doesn't assume that the variance in finding standard errors is constant. Instead, the standard errors are adjusted to the variability of the sample data. This is called the sandwich estimate or robust standard error estimate. Software can sometimes calculate these standard errors, then they can be compared to the regular standard errors. If they differ a lot, the assumptions are badly violated. The robust variance can be applied to strongly correlated data like clusters. Then generalized estimating equations (GEE) are used; estimates of equations with the maximum likelihood but without the parametric probability distribution that usually goes along with correlations.

A recently developed nonparametric method is generalized additive modelling. This is a generalization of the generalized linear model. Smoothing a curve exposes larger trends. Popular smoothers are LOESS and kernel.

Access:

Public

Verzekeren bij een faire en solidaire zorgverzekeraar?

Join WorldSupporter!

Join with a free account for more service, or become a member for full access to exclusives and extra support of WorldSupporter >>

Check: concept of JoHo WorldSupporter

Concept of JoHo WorldSupporter

JoHo WorldSupporter mission and vision:

JoHo wants to enable people and organizations to develop and work better together, and thereby contribute to a tolerant and sustainable world. Through physical and online platforms, it supports personal development and promote international cooperation is encouraged.

JoHo concept:

As a JoHo donor, member or insured, you provide support to the JoHo objectives. JoHo then supports you with tools, coaching and benefits in the areas of personal development and international activities.
JoHo's core services include: study support, competence development, coaching and insurance mediation when departure abroad.