Several Aspects of Regression Analysis (13)

13. Several Aspects of Regression Analysis

This chapter focusses on topics that add to the understanding of regression analysis. This includes alternative specifications for these models, and what happens in the situations where basic regression assumptions are violated.

13.1. Developing models

The goal when developing a model is to approximate the complex reality as close as possible with a relatively simple model, which can then be used to provide insight into reality. It is impossible to represent all of the influences in the real situation in a model, instead only the most influential variables are selected.

Building a statistical model has 4 stages:

  1. Model Specification: This step involves the selection of the variables (dependent and independent), the algebraic form of the model, and the required data. In order to this correctly it is important to understand the underlying theory and context for the model. This stage may require serious study and analysis. This step is crucial to the integrity of the model.
  2. Coefficient Estimation: This step involves using the available data to estimate the coefficients and/or parameters in the model. The desired values are dependent on the objective of the model. Roughly there are two goals:
    1. Predicting the mean of the dependent variable: In this case it is desirable to have a small standard error of the estimate, se. The correlations between independent variables need to be steady, and there needs to be a wide spread for these independent variables (as this means that the prediction variance is small).
    2. Estimating one or more coefficients: In this case a number of problems arise, as there is always a trade-off between estimator bias and variance, within which a proper balance must be found. Including an independent variable that is highly correlated with other independent variables decreases bias but increases variance. Excluding the variable decreases variance, but increases bias. This is the case because both these correlations and the spread of the independent variables influence the standard deviation of the slope coefficients, sb.
  3. Model Verification: This step involves checking whether the model is still accurate in its portrayal of reality. This is important because simplifications and assumptions are often made while constructing the model, this can lead to the model becoming (too) inaccurate). It is important to examine the regression assumptions, the model specification, and the selected data. If something is wrong here, we return to step 1.
  4. Interpretation and Inference: This step involves drawing conclusions from the outcomes of the model. Here it is important to remain critical. Inferences drawn from these outcomes can only be accurate if the previous 3 steps have been completed properly. If these outcomes differ from expectations or previous findings you must be critical about whether this is due to the model or whether you really have found something new.

13.2. Further Application of Dummy Variables

Dummy variables were introduced in chapter 12 as a way to include categorical variables in regression analysis. Further uses for these variables will be discussed here.

Dummy variables have values of either 1 or 0, to represent two categories. It is also possible to represent more than two categories by using a combination of multiple dummy variables. The rule is: number of categories -1 = number of dummy variables. So for three categories, two dummy variables are used. For example:

Yes:       x1= 1     x2=1

Maybe: x1=1      x2=0

No:         x1=0      x2=0

Time series data can be portrayed by dummy variables in the same manner. In this case time periods are the categories.

Dummy variables are becoming more popular as a tool in experimental designs. Here again similar specification is used to represent several levels of the treatment. In experimental designs there are also so-called blocking variables, which are part of the environment and cannot be randomized or preselected. By using dummy variables these can be included in the model in such a manner that it’s variability can be removed from the independent variables.

13.3. Values in Time-Series Data

When measurements are taken over time the values of the dependent variable are referred to as lagged. Such time-series observations are specified in formulates with the subscript “t”.

It is important to be aware of lagged values because the value of the dependent variable in one time period is often related to the value of a previous time period. This previous value of the dependent variable is called a lagged dependent variable.

With lagged dependent variables there is no difference in coefficient estimation, confidence intervals and hypothesis tests in comparison to a regular dependent variable. It is, however, advised to be cautious with the use of confidence intervals and hypothesis tests, as it is possible that the equation errors, εi, are no longer independent, which leads to the coefficient estimates no longer being efficient (though unbiased), which means that confidence intervals and hypothesis tests are no longer valid. If the equation errors remain independent, the quality of the approximation will improve as the number of sample observations increases.

13.4. Inclusion of proper independent variables

Models can never be as complete as reality because a model cannot contain al variables that are likely to affect the dependent variable in the real world, so a selection must be made. The joint influence of the variables that are not selected are then absorbed in the error term. If, however, an important variable is omitted this means that the estimated coefficients of the other independent variables will be different, and any conclusions drawn from this model may be faulty.

13.5. Multicollinearity of independent variables

It is possible for two independent variables to be highly correlated. In this case the estimated coefficients of the model can be very misleading. This phenomenon is referred to as multicollinearity. This problem arises out of the data itself, and sadly there is little that can be done about it. It is still important to be watchful of this though. Indications for multicollinearity are:

  • The regression coefficients are very different from previous research or expectations.
  • Coefficients of variables that are believed to have a strong influence, have a small students t-statistic.
  • The student’s t-statistics are small for all coefficients, even though there is a large F-statistic (indicating no individual effect but strong effect for the total model).
  • There are high correlations between individual independent variables and/or there is a strong linear regression relationship between independent variables.

There are several possibilities to correct multicollinearity:

  • Removing one or more of the highly correlated independent variables (this may also have side effects, see 13.4).
  • Changing the specification of the model by including a new variable that is a function of the correlated variables.
  • Obtaining additional data where the correlation between the independent variables is weaker.

13.6. Variance of error terms

When one or more of the regression assumptions are violated the least squared method will lead to inefficient estimated coefficients and misleading conclusions. One of these assumptions is the assumption of homoscedasticity, that the error terms are uniformly distributed and are not correlated. This assumptions is violated if a model exhibits heteroscedasticity. There are various ways to check for this:

  • Relating the error variance to an alternative explanation.
  • Making a scatterplot of the residuals versus the independent variables and the predicted values from the regression. A visible relationship (like errors increasing with increasing X-values) is a sign of heteroscedasticity.
  • Testing the null hypothesis that the error terms have the same variance, against the alternative hypothesis that their variances depend on the expected values. This procedure can be used when the predicted value of the dependent variable has a linear relationship with the variance of the error term.

It is also possible that there is simply an appearance of heteroscedasticity, for example if logarithmic model is more appropriate but a linear regression model was estimated instead.

13.7. Correlated error terms

The error term represents all variables that influence the dependent variable, outside of the independent variables. In time-series data this term functions differently, as many of these variables may behave similarly over time. This can thus result in a correlation between error terms, also referred to as auto correlated errors. This means that the estimated standard errors for the coefficients are biased, null hypotheses might be falsely rejected, and confidence intervals would be too narrow.

Assuming all errors have the same variance, the structure for autocorrelation, or the first-order autoregressive model of auto correlated behaviour, is:

Where ρ is a correlation coefficient, and ut is a random variable (thus not auto correlated). The coefficient ρ varies from -1 to +1, where a ρ of 0 signifies no autocorrelation, while a -1 or +1 signifies strong autocorrelation.

Autocorrelation can also be found by time plotting the residuals, a jagged plot signifies no autocorrelation.

A more formal test of autocorrelation is the Durbin-Watson test, based on model residuals. It is calculated as follows:

The Durbin-Watson statistic can be written as: d=2(1-r). Where r is the sample estimate of the population correlation between adjacent errors.

If the errors are not auto correlated, r≈0 and d≈2.

A positive correlation is shown as: 0≤d

A negative correlation is shown as: 2

When there are auto correlated errors then the regression procedure needs to be modified to remove its effects. Estimating the coefficients of such a model follows two steps:

  1. The model is estimated using least squares, which obtains the Durbin-Watson d-statistic. The r, as autocorrelation parameter, can then be calculated.
  2. Use least squares to estimate a second regression with:
    Dependent variable: yt  à (yt – ryt-1)
    Independent variable: β1x1t  à (xi1 – rx1,t-1)
  3. Divide the estimated intercept from this second model by (1-r) to get the correct estimated intercept for the original model.
  4. Use the output from the second model to carry out hypothesis tests and confidence intervals.

An even more severe problem presents itself when there is a model with lagged dependent variables and auto correlated errors. Here the model also needs to be modified, using a variation on the procedure explained above.

Image

Access: 
Public

Image

Image

 

 

Contributions: posts

Help other WorldSupporters with additions, improvements and tips

Add new contribution

CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image.

Image

Spotlight: topics

Image

Check how to use summaries on WorldSupporter.org

Online access to all summaries, study notes en practice exams

How and why use WorldSupporter.org for your summaries and study assistance?

  • For free use of many of the summaries and study aids provided or collected by your fellow students.
  • For free use of many of the lecture and study group notes, exam questions and practice questions.
  • For use of all exclusive summaries and study assistance for those who are member with JoHo WorldSupporter with online access
  • For compiling your own materials and contributions with relevant study help
  • For sharing and finding relevant and interesting summaries, documents, notes, blogs, tips, videos, discussions, activities, recipes, side jobs and more.

Using and finding summaries, notes and practice exams on JoHo WorldSupporter

There are several ways to navigate the large amount of summaries, study notes en practice exams on JoHo WorldSupporter.

  1. Use the summaries home pages for your study or field of study
  2. Use the check and search pages for summaries and study aids by field of study, subject or faculty
  3. Use and follow your (study) organization
    • by using your own student organization as a starting point, and continuing to follow it, easily discover which study materials are relevant to you
    • this option is only available through partner organizations
  4. Check or follow authors or other WorldSupporters
  5. Use the menu above each page to go to the main theme pages for summaries
    • Theme pages can be found for international studies as well as Dutch studies

Do you want to share your summaries with JoHo WorldSupporter and its visitors?

Quicklinks to fields of study for summaries and study assistance

Main summaries home pages:

Main study fields:

Main study fields NL:

Follow the author: Dara Yapp
Work for WorldSupporter

Image

JoHo can really use your help!  Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world

Working for JoHo as a student in Leyden

Parttime werken voor JoHo

Statistics
1037