What is logistic regression? – Chapter 15

15.1 What are the basics of logistic regression?
15.2 What does multiple logistic regression look like?
15.3 How does inference with logistic regression models work?
15.4 How is logistic regression performed for ordinal variables?
15.5 What do logistic models with nominal responses look like?
15.6 How do loglinear models describe the associations between categorical variables?
15.7 How do goodness-of-fit tests work for contingency tables?

15.1 What are the basics of logistic regression?

A logistic regression model is a model with a binary response variable (like 'agree' or 'don't agree'). It's also possible for logistic regression models to have ordinal or nominal response variables. The mean is the proportion of responses that are 1. The linear probability model is P(y=1) = α + βx. This model often is too simple, a more extended version is:

$log \left[\frac{P(y=1))}{1-P(y=1))} \right]=\alpha + \beta x$

The logarithm can be calculated using software. The odds are:: P(y=1)/[1-P(y=1)]. The log of the odds, or logistic transformation (abbreviated as logit) is the logistic regression model: logit[P(y=1)] = α + βx.

To find the outcome for a certain value of a predictor, the following formula is used:

$P(y=1) =\frac{e^{\alpha+\beta{x}}}{1+e^{\alpha+\beta{x}}}$

The e to a certain power is the antilog of that number.

A straight line is drawn next to the curve of a logistic graph to analyze it. β is maximal where P(y=1) = ½. For logistic regression the maximal likelihood method is used instead of the least squares method. The model expressed in odds is:

$\frac{P(y=1)}{1-P(y=1)} = e^{\alpha+\beta{x}}=e^{\alpha}(e^\beta)^x$

The estimate is:

$\frac{\hat{P}(y=1)}{1-\hat{P}(y=1)}$

With this the odds ratio can be calculated.

There are two possibilities to present the data. For ungrouped data a normal contingency table suffices. For grouped data a row contains data for every count in a cel, like just one row with the number of subjects that agreed, followed by the total number of subjects.

An alternative of the logit is the probit. This link assumes a hidden, underlying continuous variable y* that is 1 above a certain value T (threshold) and that is 0 below T. Because y* is hidden, it's called a latent variable. However, it can be used to make a probit model: probit[P(y=1)] = α + βx.

Logistic regression with repeated measures and random effects is analyzed with a linear mixed model: logit[P(y_ij = 1)] = α + βx_ij + s_i.

15.2 What does multiple logistic regression look like?

The multiple logistic regression model is: logit[P(y = 1)] = α + β₁x₁ + … + β_px_p. The further β_i is from 0, the stronger the effect of x_i is and the further the odds ratio is from 1. If needed, cross-product terms and dummy variables can be added.

Research results are often expressed in terms of odds instead of the log odds scale, because the odds are easier to interpret. The odds is the multiplied version of the antilog. To present the results even more clearly, they're expressed in probabilities. For instance the chance that a certain value is the output, while controlling for the other variables. The estimated probability is:

$\hat{P}(y=1)=\frac{Odds}{1+Odds}$

The standardized estimate allows to compare the effects of explanatory variables using different units of measurement:

$\hat{\beta}_j^*=\hat{\beta}_js_{xj}$

The s_xj is the standard deviation of the variable x_j.

To help prevent selection bias in observation studies, the propensity is used, the probability that a subjects ends up in a certain group. By managing this, researchers can control and group the kind of people that find themselves in a certain situation. However, this only manages observed confounding variables. Variables unknown to the researchers remain hidden.

15.3 How does inference with logistic regression models work?

A logistic regression model assumes the binomial distribution and is shaped like this: logit[P(y = 1)] = α + β₁x₁ + … + β_px_p. The general null hypothesis is H₀ : β₁ = … = β_p = 0 and is tested by the likelihood-ratio test. This inferential test compares a complete model to a reduced model. The likelihood function (ℓ) is the probability that the observed data result from the parameter values. For instance, ℓ₀ is the maximal likelihood function if the null hypothesis is true and ℓ₁ when it is not true. The test statistic is: -2 log (ℓ₀ /ℓ₁ ) = (-2 log ℓ₀ ) – (-2 log ℓ₁ ).

Alternative test statistics are z and z squared (called the Wald statistic):

$z = \hat{\beta }_i/se$

But for small samples or extreme effects the likelihood ratio test works better..

15.4 How is logistic regression performed for ordinal variables?

Ordinal variables assume a certain order in the categories. The cumulative probability is the probability that a response falls in a certain category j or below: P(y ≤ j). Each cumulative probability can be transformed to odd, for instance the odds that a response falls in category j or below: P(y ≤ j) / P(y > j).

Cumulative logits are popular, these divide the responses into a binary scale: logit[P(y ≤ 1)] = α_j – βx in which j = 1, 2, …, c – 1 and c is the number of categories. Beware, some software puts + instead of – in front of the slope.

A proportional odds model is a cumulative logit model in which the slope is the same for every cumulative probability, so β doesn't vary. The slope indicates the steepness of the graph, so in a proportional odds model the lines of the different categories are equally steep.

Cumulative logit models can have multiple explanatory variables. H₀ : β tests whether they are independent. An independence test for logistic regression with ordinal variables results in a P-value that is more clear than tests that ignore the order in the data, like the chi squared test. A confidence interval is also an option.

An advantage of the cumulative logit model is invariance towards the scale of responses. If a researcher uses a different number of categories, he/she will still reach the same conclusions.

15.5 What do logistic models with nominal responses look like?

For nominal variables (without order) a model exists that specifies the probabilities that a certain outcome happens instead of another outcome. This model calculates these probabilities simultanously and it presumes independent observations. This is the baseline-category logit model:

$log \left[\frac{P(y=j)}{P(y=c)} \right ]=\alpha_j+\beta_{j}x$

It doesn't matter which category is in the baseline. Inference works similarly to logistic regression, but to test the effect of an explanatory variable, all parameters of the comparisons are involved. The likelihood ratio test examines if the model fits the data better with or without a certain value.

15.6 How do loglinear models describe the associations between categorical variables?

Most models study the effect of an explanatory variable on a response variable. Loglinear models are different, they study the associations between (categorical) variables, for instance in a contingency table. These models are more alike correlations.

A loglinear model assumes the Poisson distribution; non-negative discrete variables, like counts, based on the multinomial distribution.

A contingency table can show multiple categorical response variables. A conditional association is an association between two variables, while a third variable is controlled for. When variables are conditionally independent, they are independent of each category of the third variable. A hierarchy of dependence is the following (accompanied by symbols for the response variables x, y and z):

All three are conditionally independent (x, y, z)
Two pairs are conditionally independent (xy, z)
One pair is conditionally independent (xy, yz)
There is no conditional independence, but there is a homogeneous association, meaning the association for each possible pair is the same for each category of the third variable (xy, yz, xz)
All pairs are associated and there is interaction, this is a saturated model (xyz)

Also linear models can be interpreted using the odds ratio.

15.7 How do goodness-of-fit tests work for contingency tables?

A goodness-of-fit test investigates the null hypothesis that a model really fits a certain population. It measures whether the estimated frequencies f_e are close to the observed frequencies f_o . Bigger test statistics are bigger evidence that the model is incorrect. This is measured by the Pearson chi squared test:

$X^2 =\sum \frac{(f_o-f_e)^2}{f_e}$

Another version is the likelihood ratio chi-squared test:

$G^2=2\sum F_o log\left(\frac{f_o}{f_e} \right )$

When the model fits reality perfectly, then both X² and G² are 0. The likelihood ratio test is better in case of large samples. The Pearson test is better for frequencies that average between 1 and 10/ Both tests only work well for contingency tables with categorical predictors and relatively big counts.

To see what exactly doesn't fit, the standardized residuals can be calculated per cel: (f_o – f_e) / (standard error of (f_o – f_e)). When a standardized residual exceeds 3, for that cel the model doesn't fit the data.

Goodness-of fit tests and standardized residuals can also be applied to loglinear models.

To see if a complete or a reduced model fits better, the likelihood ratios can be compared.

Access:

Public

Join WorldSupporter!

Join with a free account for more service, or become a member for full access to exclusives and extra support of WorldSupporter >>

This content is related to:

Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)

Check more of topic:

Samenvattingen voor psychologie en gedrag

Universiteit Groningen en studieverenigingen

This content is used in:

Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)

Selected contributions for Understanding logistic regression

Going abroad?

Insure your way around the world

International expat insurances

Travel & Worldsupporter insurances (NL)

Study with summaries

Contributions: posts

Help other WorldSupporters with additions, improvements and tips

Spotlight: topics

Check the related and most recent topics and summaries:

Activities abroad, study fields and working areas:

Which kinds of samples and variables are possible? – Chapter 2

What are the main measures and graphs of descriptive statistics? - Chapter 3

What role do probability distributions play in statistical inference? – Chapter 4

How can you make estimates for statistical inference? – Chapter 5

How do you perform significance tests? – Chapter 6

How do you compare two groups in statistics? - Chapter 7

How do you analyze the association between categorical variables? – Chapter 8

How do linear regression and correlation work? – Chapter 9

Which types of multivariate relationships exist? – Chapter 10

What is multiple regression? – Chapter 11

What is ANOVA? – Chapter 12

How does multiple regression with both quantitative and categorical predictors work? – Chapter 13

How do you make a multiple regression model for extreme or strongly correlating data? – Chapter 14

What is logistic regression? – Chapter 15

Selected contributions for Understanding logistic regression

Selected contributions of other WorldSupporters on the topic of Understanding logistic regression

What is logistic regression? – Chapter 15

Categorical outcomes: logistic regression - summary of (part of) chapter 20 of Statistics by A. Field

MVDA - logistic regression analysis

Call to action: Do you have statistical knowledge and skills and do you enjoy helping others while expanding your international network?

Applying logistic regression

Check how to use summaries on WorldSupporter.org

Submenu: Summaries & Activities

Follow the author: Annemarie JoHo

Work for WorldSupporter

JoHo can really use your help! Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world

Working for JoHo as a student in Leyden

Parttime werken voor JoHo

Statistics

Search a summary, study help or student organization

Select any filter and click on Search to see results

What is logistic regression? – Chapter 15

15.1 What are the basics of logistic regression?

15.2 What does multiple logistic regression look like?

15.3 How does inference with logistic regression models work?

15.4 How is logistic regression performed for ordinal variables?

15.5 What do logistic models with nominal responses look like?

15.6 How do loglinear models describe the associations between categorical variables?

15.7 How do goodness-of-fit tests work for contingency tables?

Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)

Samenvattingen voor psychologie en gedrag

Universiteit Groningen en studieverenigingen

Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)

Selected contributions for Understanding logistic regression

Contributions: posts

Add new contribution

Spotlight: topics

Samenvattingen voor psychologie en gedrag

Universiteit Groningen en studieverenigingen

Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)

Selected contributions for Understanding logistic regression

Online access to all summaries, study notes en practice exams

How and why use WorldSupporter.org for your summaries and study assistance?

Using and finding summaries, notes and practice exams on JoHo WorldSupporter

Quicklinks to fields of study for summaries and study assistance