Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)
- 2894 reads
Three basic rules for selecting variables to add to a model are:
Select variables that can answer the theoretical purpose (accepting/rejecting the null hypothesis), with sensible control variables and mediating variables
Add enough variables for a good predictive power
Keep the model simple
The explanatory variables should be highly correlated to the response variable but not to each other. Software can test and select explanatory variables. Possible strategies are backward elimination, forward selection and stepwise regression. In backward elimination all possible variables are added, tested for their P-value and then only the significant variables are selected. Forward selection starts from scratch, adding variables with the lowest P-value. Another version of this is stepwise regression, this method removes redundant variables when new variables are added.
Software helps but it's up to the researcher to think and make choices. It also matters whether research is explanatory, starting with a theoretical model with known variables, or whether research is exploratory, openly looking for explanations of a phenomenon.
Several criteria are indications of a good model. To find a model with big power but without an overabundance of variables, the adjusted R2 is used:
The adjusted R2 decreases when an unnecessary variable is added.
Cross-validation continuously checks whether the predicted values are as close as possible to the observed values. The result is the predicted residual sum of squares (PRESS):
If PRESS decreases, the predictions get better. However, this test assumes a normal distribution. A method that can handle other distributions, is the Akaike information criterion (AIC), which selects the model in which ŷi is as close as possible to E(yi). If AIC decreases, the predictions get better.
Inference of parameters in a regression model has the following assumptions:
The model fits the shape of the data
The conditional distribution of y is normal
The standard deviation is constant in the range of values of the explanatory variables (this is called homoscedasticity)
It's a random sample
Big violations of these assumptions have consequences.
When y has the normal distribution, the residuals do too. A studentized residual is a standardized version: the residual divided by the standard error. This indicates how much of the variabilities in the residuals is explained by the variability of the sampling. A studentized residual exceeding 3 may indicate an outlier.
The randomization in longitudinal research may be limited when the observations within a certain time frame are strongly correlated. A scatterplot of the residuals for the entire time frame can check this. This kind of correlation has a bad influence on most statistics. In longitudinal research, often conducted within social science and in a relatively short time frame, a linear mixed model is used. However, when research involves time series and a longer time frame, then econometric methods are more appropriate.
Lots of statistics measure the effects of outliers. The residuals measures how far y is from the trend. The leverage (h) measures how far the explanatory variables are from their means. When observations have a high residual and high leverage, they have a big influence.
DFBETA describes the effect of an observation on the estimates of the parameters. DFFIT and Cook's distance describe the effect on how a graph fits the data when a certain observation is omitted.
In case of lots of strongly correlated explanatory variables, R² increases only slightly when more variables are added. This is called multicollinearity. It causes the standard errors to increase. Due to the bigger confidence interval, the variance increases. This is measured by the variance inflation factor (VIF), the multiplied increase in variance that is caused by the correlation of the explanatory variables:
Also without the VIF indications of multicollinearity are visible in the equation. What helps against it, is choosing only some variables, converging variables or centering them. With factor analysis new, artificial variables are created from the existing variables, to avoid correlation. But usually factor analysis isn't necessary.
Generalized linear models (GLM) is a broad term that includes both regression models with a normal distribution, alternative models for continuous variables without a normal distribution and models with discrete variables.
The outcome of a GLM is often binary, sometimes counts. When the data is very discrete, the GLM uses the gamma distribution.
A GLM has a link function; an equation that connects the mean of the response variable to the explanatory variables: g(μ) = α + β1x1 + β2x2 + … + βpxp. When the data can't be negative, the log link is used for loglinear models: log(μ) = α + β1x1 + β2x2 + … + βpxp. A logistic regression model uses the logit link: g(μ) = log[μ /(1-μ)]. This is useful when μ is between 0 and 1. Most simple is the identity link: g(μ) = μ.
Because a GLM uses the maximum likelihood method, the data doesn't need to have a normal distribution. The maximum likelihood method uses weighted least squares, this method gives more weight to observations with a smaller variability.
A gamma distribution allows different shapes of the standard deviation. This is called heteroscedasticity; the standard deviation increases when the mean increases. Then the variance is ø μ2 and the standard deviation is:
ø is the scale parameter, it indicates the scale that creates the shape of a distribution (for instance a bell).
When a graph is very nonlinear, but for instance curvilinear, the polynomial regression function is used: E(y) = α + β1x + β2x2 in which the highest power is called the degree. A polynomial regression function can express a quadratic regression model, a parabola.
A cubic function is a polynomial function with three degrees, but usually a function with two degrees suffices. For a straight line the slope maintains the same shape, but in a polynomial function it changes. When the coefficient of x² is positive, the data will be shaped like an inverted U. When the coefficient is negative, the data will be shaped like an U. The highest or lowest point of the parabola is: x = – β1 / 2(β2).
In these kind of models R² is the proportional decrease of the error estimates by using a quadratic function instead of a linear function. A comparison of R² and r² indicates how much better of a fit the quadratic function is. The null hypothesis can be tested that a quadratic function doesn't add to the model: H0: β2 = 0.
Conclusions should be made carefully, sometimes other shapes are possible too. Parsimony should be the goal, models shouldn't have more parameters than necessary.
An exponential regression function is E(y) = α βx. It only has positive values and either increases or decreases endlessly. The logarithm of the mean is: log(μ) = log α + (log β)x. In this model, β is the multiplied change of y for an increase of 1 in x. When a graph needs to be transformed into a linear function, then log transforms can be used, they linearize the relationship.
Robust variance allows mending regression models so they can handle violations of assumptions. This method uses the least squares line but doesn't assume that the variance in finding standard errors is constant. Instead, the standard errors are adjusted to the variability of the sample data. This is called the sandwich estimate or robust standard error estimate. Software can sometimes calculate these standard errors, then they can be compared to the regular standard errors. If they differ a lot, the assumptions are badly violated. The robust variance can be applied to strongly correlated data like clusters. Then generalized estimating equations (GEE) are used; estimates of equations with the maximum likelihood but without the parametric probability distribution that usually goes along with correlations.
A recently developed nonparametric method is generalized additive modelling. This is a generalization of the generalized linear model. Smoothing a curve exposes larger trends. Popular smoothers are LOESS and kernel.
Join with a free account for more service, or become a member for full access to exclusives and extra support of WorldSupporter >>
Summary of Statistical methods for the social sciences by Agresti, 5th edition, 2018. Summary in English.
There are several ways to navigate the large amount of summaries, study notes en practice exams on JoHo WorldSupporter.
Do you want to share your summaries with JoHo WorldSupporter and its visitors?
Main summaries home pages:
Main study fields:
Business organization and economics, Communication & Marketing, Education & Pedagogic Sciences, International Relations and Politics, IT and Technology, Law & Administration, Medicine & Health Care, Nature & Environmental Sciences, Psychology and behavioral sciences, Science and academic Research, Society & Culture, Tourisme & Sports
Main study fields NL:
JoHo can really use your help! Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world
1786 |
Add new contribution