Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)
- 2896 keer gelezen
A logistic regression model is a model with a binary response variable (like 'agree' or 'don't agree'). It's also possible for logistic regression models to have ordinal or nominal response variables. The mean is the proportion of responses that are 1. The linear probability model is P(y=1) = α + βx. This model often is too simple, a more extended version is:
The logarithm can be calculated using software. The odds are:: P(y=1)/[1-P(y=1)]. The log of the odds, or logistic transformation (abbreviated as logit) is the logistic regression model: logit[P(y=1)] = α + βx.
To find the outcome for a certain value of a predictor, the following formula is used:
The e to a certain power is the antilog of that number.
A straight line is drawn next to the curve of a logistic graph to analyze it. β is maximal where P(y=1) = ½. For logistic regression the maximal likelihood method is used instead of the least squares method. The model expressed in odds is:
The estimate is:
With this the odds ratio can be calculated.
There are two possibilities to present the data. For ungrouped data a normal contingency table suffices. For grouped data a row contains data for every count in a cel, like just one row with the number of subjects that agreed, followed by the total number of subjects.
An alternative of the logit is the probit. This link assumes a hidden, underlying continuous variable y* that is 1 above a certain value T (threshold) and that is 0 below T. Because y* is hidden, it's called a latent variable. However, it can be used to make a probit model: probit[P(y=1)] = α + βx.
Logistic regression with repeated measures and random effects is analyzed with a linear mixed model: logit[P(yij = 1)] = α + βxij + si.
The multiple logistic regression model is: logit[P(y = 1)] = α + β1x1 + … + βpxp. The further βi is from 0, the stronger the effect of xi is and the further the odds ratio is from 1. If needed, cross-product terms and dummy variables can be added.
Research results are often expressed in terms of odds instead of the log odds scale, because the odds are easier to interpret. The odds is the multiplied version of the antilog. To present the results even more clearly, they're expressed in probabilities. For instance the chance that a certain value is the output, while controlling for the other variables. The estimated probability is:
The standardized estimate allows to compare the effects of explanatory variables using different units of measurement:
The sxj is the standard deviation of the variable xj.
To help prevent selection bias in observation studies, the propensity is used, the probability that a subjects ends up in a certain group. By managing this, researchers can control and group the kind of people that find themselves in a certain situation. However, this only manages observed confounding variables. Variables unknown to the researchers remain hidden.
A logistic regression model assumes the binomial distribution and is shaped like this: logit[P(y = 1)] = α + β1x1 + … + βpxp. The general null hypothesis is H0 : β1 = … = βp = 0 and is tested by the likelihood-ratio test. This inferential test compares a complete model to a reduced model. The likelihood function (ℓ) is the probability that the observed data result from the parameter values. For instance, ℓ0 is the maximal likelihood function if the null hypothesis is true and ℓ1 when it is not true. The test statistic is: -2 log (ℓ0 /ℓ1 ) = (-2 log ℓ0 ) – (-2 log ℓ1 ).
Alternative test statistics are z and z squared (called the Wald statistic):
But for small samples or extreme effects the likelihood ratio test works better..
Ordinal variables assume a certain order in the categories. The cumulative probability is the probability that a response falls in a certain category j or below: P(y ≤ j). Each cumulative probability can be transformed to odd, for instance the odds that a response falls in category j or below: P(y ≤ j) / P(y > j).
Cumulative logits are popular, these divide the responses into a binary scale: logit[P(y ≤ 1)] = αj – βx in which j = 1, 2, …, c – 1 and c is the number of categories. Beware, some software puts + instead of – in front of the slope.
A proportional odds model is a cumulative logit model in which the slope is the same for every cumulative probability, so β doesn't vary. The slope indicates the steepness of the graph, so in a proportional odds model the lines of the different categories are equally steep.
Cumulative logit models can have multiple explanatory variables. H0 : β tests whether they are independent. An independence test for logistic regression with ordinal variables results in a P-value that is more clear than tests that ignore the order in the data, like the chi squared test. A confidence interval is also an option.
An advantage of the cumulative logit model is invariance towards the scale of responses. If a researcher uses a different number of categories, he/she will still reach the same conclusions.
For nominal variables (without order) a model exists that specifies the probabilities that a certain outcome happens instead of another outcome. This model calculates these probabilities simultanously and it presumes independent observations. This is the baseline-category logit model:
It doesn't matter which category is in the baseline. Inference works similarly to logistic regression, but to test the effect of an explanatory variable, all parameters of the comparisons are involved. The likelihood ratio test examines if the model fits the data better with or without a certain value.
Most models study the effect of an explanatory variable on a response variable. Loglinear models are different, they study the associations between (categorical) variables, for instance in a contingency table. These models are more alike correlations.
A loglinear model assumes the Poisson distribution; non-negative discrete variables, like counts, based on the multinomial distribution.
A contingency table can show multiple categorical response variables. A conditional association is an association between two variables, while a third variable is controlled for. When variables are conditionally independent, they are independent of each category of the third variable. A hierarchy of dependence is the following (accompanied by symbols for the response variables x, y and z):
All three are conditionally independent (x, y, z)
Two pairs are conditionally independent (xy, z)
One pair is conditionally independent (xy, yz)
There is no conditional independence, but there is a homogeneous association, meaning the association for each possible pair is the same for each category of the third variable (xy, yz, xz)
All pairs are associated and there is interaction, this is a saturated model (xyz)
Also linear models can be interpreted using the odds ratio.
A goodness-of-fit test investigates the null hypothesis that a model really fits a certain population. It measures whether the estimated frequencies fe are close to the observed frequencies fo . Bigger test statistics are bigger evidence that the model is incorrect. This is measured by the Pearson chi squared test:
Another version is the likelihood ratio chi-squared test:
When the model fits reality perfectly, then both X2 and G2 are 0. The likelihood ratio test is better in case of large samples. The Pearson test is better for frequencies that average between 1 and 10/ Both tests only work well for contingency tables with categorical predictors and relatively big counts.
To see what exactly doesn't fit, the standardized residuals can be calculated per cel: (fo – fe) / (standard error of (fo – fe)). When a standardized residual exceeds 3, for that cel the model doesn't fit the data.
Goodness-of fit tests and standardized residuals can also be applied to loglinear models.
To see if a complete or a reduced model fits better, the likelihood ratios can be compared.
Join with a free account for more service, or become a member for full access to exclusives and extra support of WorldSupporter >>
Summary of Statistical methods for the social sciences by Agresti, 5th edition, 2018. Summary in English.
Selected contributions of other WorldSupporters on the topic of Understanding logistic regression
There are several ways to navigate the large amount of summaries, study notes en practice exams on JoHo WorldSupporter.
Do you want to share your summaries with JoHo WorldSupporter and its visitors?
Main summaries home pages:
Main study fields:
Business organization and economics, Communication & Marketing, Education & Pedagogic Sciences, International Relations and Politics, IT and Technology, Law & Administration, Medicine & Health Care, Nature & Environmental Sciences, Psychology and behavioral sciences, Science and academic Research, Society & Culture, Tourisme & Sports
Main study fields NL:
JoHo can really use your help! Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world
2305 | 1 |
Add new contribution