13. Several Aspects of Regression Analysis
This chapter focusses on topics that add to the understanding of regression analysis. This includes alternative specifications for these models, and what happens in the situations where basic regression assumptions are violated.
13.1. Developing models
The goal when developing a model is to approximate the complex reality as close as possible with a relatively simple model, which can then be used to provide insight into reality. It is impossible to represent all of the influences in the real situation in a model, instead only the most influential variables are selected.
Building a statistical model has 4 stages:
- Model Specification: This step involves the selection of the variables (dependent and independent), the algebraic form of the model, and the required data. In order to this correctly it is important to understand the underlying theory and context for the model. This stage may require serious study and analysis. This step is crucial to the integrity of the model.
- Coefficient Estimation: This step involves using the available data to estimate the coefficients and/or parameters in the model. The desired values are dependent on the objective of the model. Roughly there are two goals:
- Predicting the mean of the dependent variable: In this case it is desirable to have a small standard error of the estimate, se. The correlations between independent variables need to be steady, and there needs to be a wide spread for these independent variables (as this means that the prediction variance is small).
- Estimating one or more coefficients: In this case a number of problems arise, as there is always a trade-off between estimator bias and variance, within which a proper balance must be found. Including an independent variable that is highly correlated with other independent variables decreases bias but increases variance. Excluding the variable decreases variance, but increases bias. This is the case because both these correlations and the spread of the independent variables influence the standard deviation of the slope coefficients, sb.
- Model Verification: This step involves checking whether the model is still accurate in its portrayal of reality. This is important because simplifications and assumptions are often made while constructing the model, this can lead to the model becoming (too) inaccurate). It is important to examine the regression assumptions, the model specification, and the selected data. If something is wrong here, we return to step 1.
- Interpretation and Inference: This step involves drawing conclusions from the outcomes of the model. Here it is important to remain critical. Inferences drawn from these outcomes can only be accurate if the previous 3 steps have been completed properly. If these outcomes differ from expectations or previous findings you must be critical about whether this is due to the model or whether you really have found something new.
13.2. Further Application of Dummy Variables
Dummy variables were introduced in chapter 12 as a way to include categorical variables in regression analysis. Further uses for these variables will be discussed here.
Dummy variables have values of either 1 or 0, to represent two categories. It is also possible to represent more than two categories by using a combination of multiple dummy variables. The rule is: number of categories -1 = number of dummy variables. So for three categories, two dummy variables are used. For example:
Yes: x1= 1 x2=1
Maybe: x1=1 x2=0
No: x1=0 x2=0
Time series data can be portrayed by dummy variables in the same manner. In this case time periods are the categories.
Dummy variables are becoming more popular as a tool in experimental designs. Here again similar specification is used to represent several levels of the treatment. In experimental designs there are also so-called blocking variables, which are part of the environment and cannot be randomized or preselected. By using dummy variables these can be included in the model in such a manner that it’s variability can be removed from the independent variables.
13.3. Values in Time-Series Data
When measurements are taken over time the values of the dependent variable are referred to as lagged. Such time-series observations are specified in formulates with the subscript “t”.
It is important to be aware of lagged values because the value of the dependent variable in one time period is often related to the value of a previous time period. This previous value of the dependent variable is called a lagged dependent variable.
With lagged dependent variables there is no difference in coefficient estimation, confidence intervals and hypothesis tests in comparison to a regular dependent variable. It is, however, advised to be cautious with the use of confidence intervals and hypothesis tests, as it is possible that the equation errors, εi, are no longer independent, which leads to the coefficient estimates no longer being efficient (though unbiased), which means that confidence intervals and hypothesis tests are no longer valid. If the equation errors remain independent, the quality of the approximation will improve as the number of sample observations increases.
13.4. Inclusion of proper independent variables
Models can never be as complete as reality because a model cannot contain al variables that are likely to affect the dependent variable in the real world, so a selection must be made. The joint influence of the variables that are not selected are then absorbed in the error term. If, however, an important variable is omitted this means that the estimated coefficients of the other independent variables will be different, and any conclusions drawn from this model may be faulty.
13.5. Multicollinearity of independent variables
It is possible for two independent variables to be highly correlated. In this case the estimated coefficients of the model can be very misleading. This phenomenon is referred to as multicollinearity. This problem arises out of the data itself, and sadly there is little that can be done about it. It is still important to be watchful of this though. Indications for multicollinearity are:
- The regression coefficients are very different from previous research or expectations.
- Coefficients of variables that are believed to have a strong influence, have a small students t-statistic.
- The student’s t-statistics are small for all coefficients, even though there is a large F-statistic (indicating no individual effect but strong effect for the total model).
- There are high correlations between individual independent variables and/or there is a strong linear regression relationship between independent variables.
There are several possibilities to correct multicollinearity:
- Removing one or more of the highly correlated independent variables (this may also have side effects, see 13.4).
- Changing the specification of the model by including a new variable that is a function of the correlated variables.
- Obtaining additional data where the correlation between the independent variables is weaker.
13.6. Variance of error terms
When one or more of the regression assumptions are violated the least squared method will lead to inefficient estimated coefficients and misleading conclusions. One of these assumptions is the assumption of homoscedasticity, that the error terms are uniformly distributed and are not correlated. This assumptions is violated if a model exhibits heteroscedasticity. There are various ways to check for this:
- Relating the error variance to an alternative explanation.
- Making a scatterplot of the residuals versus the independent variables and the predicted values from the regression. A visible relationship (like errors increasing with increasing X-values) is a sign of heteroscedasticity.
- Testing the null hypothesis that the error terms have the same variance, against the alternative hypothesis that their variances depend on the expected values. This procedure can be used when the predicted value of the dependent variable has a linear relationship with the variance of the error term.
It is also possible that there is simply an appearance of heteroscedasticity, for example if logarithmic model is more appropriate but a linear regression model was estimated instead.
13.7. Correlated error terms
The error term represents all variables that influence the dependent variable, outside of the independent variables. In time-series data this term functions differently, as many of these variables may behave similarly over time. This can thus result in a correlation between error terms, also referred to as auto correlated errors. This means that the estimated standard errors for the coefficients are biased, null hypotheses might be falsely rejected, and confidence intervals would be too narrow.
Assuming all errors have the same variance, the structure for autocorrelation, or the first-order autoregressive model of auto correlated behaviour, is:
Where ρ is a correlation coefficient, and ut is a random variable (thus not auto correlated). The coefficient ρ varies from -1 to +1, where a ρ of 0 signifies no autocorrelation, while a -1 or +1 signifies strong autocorrelation.
Autocorrelation can also be found by time plotting the residuals, a jagged plot signifies no autocorrelation.
A more formal test of autocorrelation is the Durbin-Watson test, based on model residuals. It is calculated as follows:
The Durbin-Watson statistic can be written as: d=2(1-r). Where r is the sample estimate of the population correlation between adjacent errors.
If the errors are not auto correlated, r≈0 and d≈2.
A positive correlation is shown as: 0≤d
A negative correlation is shown as: 2
When there are auto correlated errors then the regression procedure needs to be modified to remove its effects. Estimating the coefficients of such a model follows two steps:
- The model is estimated using least squares, which obtains the Durbin-Watson d-statistic. The r, as autocorrelation parameter, can then be calculated.
- Use least squares to estimate a second regression with:
Dependent variable: yt à (yt – ryt-1)
Independent variable: β1x1t à (xi1 – rx1,t-1) - Divide the estimated intercept from this second model by (1-r) to get the correct estimated intercept for the original model.
- Use the output from the second model to carry out hypothesis tests and confidence intervals.
An even more severe problem presents itself when there is a model with lagged dependent variables and auto correlated errors. Here the model also needs to be modified, using a variation on the procedure explained above.
Join with a free account for more service, or become a member for full access to exclusives and extra support of WorldSupporter >>
Contributions: posts
Spotlight: topics
Samenvatting Statistics for Business and Economics
Samenvatting voor het vak Statistics for Business and Economics op de Rijksuniversiteit Groningen. Hoofdstuk 12, 13, 15, 16, & 17.
Online access to all summaries, study notes en practice exams
- Check out: Register with JoHo WorldSupporter: starting page (EN)
- Check out: Aanmelden bij JoHo WorldSupporter - startpagina (NL)
How and why use WorldSupporter.org for your summaries and study assistance?
- For free use of many of the summaries and study aids provided or collected by your fellow students.
- For free use of many of the lecture and study group notes, exam questions and practice questions.
- For use of all exclusive summaries and study assistance for those who are member with JoHo WorldSupporter with online access
- For compiling your own materials and contributions with relevant study help
- For sharing and finding relevant and interesting summaries, documents, notes, blogs, tips, videos, discussions, activities, recipes, side jobs and more.
Using and finding summaries, notes and practice exams on JoHo WorldSupporter
There are several ways to navigate the large amount of summaries, study notes en practice exams on JoHo WorldSupporter.
- Use the summaries home pages for your study or field of study
- Use the check and search pages for summaries and study aids by field of study, subject or faculty
- Use and follow your (study) organization
- by using your own student organization as a starting point, and continuing to follow it, easily discover which study materials are relevant to you
- this option is only available through partner organizations
- Check or follow authors or other WorldSupporters
- Use the menu above each page to go to the main theme pages for summaries
- Theme pages can be found for international studies as well as Dutch studies
Do you want to share your summaries with JoHo WorldSupporter and its visitors?
- Check out: Why and how to add a WorldSupporter contributions
- JoHo members: JoHo WorldSupporter members can share content directly and have access to all content: Join JoHo and become a JoHo member
- Non-members: When you are not a member you do not have full access, but if you want to share your own content with others you can fill out the contact form
Quicklinks to fields of study for summaries and study assistance
Main summaries home pages:
- Business organization and economics - Communication and marketing -International relations and international organizations - IT, logistics and technology - Law and administration - Leisure, sports and tourism - Medicine and healthcare - Pedagogy and educational science - Psychology and behavioral sciences - Society, culture and arts - Statistics and research
- Summaries: the best textbooks summarized per field of study
- Summaries: the best scientific articles summarized per field of study
- Summaries: the best definitions, descriptions and lists of terms per field of study
- Exams: home page for exams, exam tips and study tips
Main study fields:
Business organization and economics, Communication & Marketing, Education & Pedagogic Sciences, International Relations and Politics, IT and Technology, Law & Administration, Medicine & Health Care, Nature & Environmental Sciences, Psychology and behavioral sciences, Science and academic Research, Society & Culture, Tourisme & Sports
Main study fields NL:
- Studies: Bedrijfskunde en economie, communicatie en marketing, geneeskunde en gezondheidszorg, internationale studies en betrekkingen, IT, Logistiek en technologie, maatschappij, cultuur en sociale studies, pedagogiek en onderwijskunde, rechten en bestuurskunde, statistiek, onderzoeksmethoden en SPSS
- Studie instellingen: Maatschappij: ISW in Utrecht - Pedagogiek: Groningen, Leiden , Utrecht - Psychologie: Amsterdam, Leiden, Nijmegen, Twente, Utrecht - Recht: Arresten en jurisprudentie, Groningen, Leiden
JoHo can really use your help! Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world
1037 |
Add new contribution