This page is about logistic regression with a categorical dependent variable and quantitative or dichotomous independent variables.
In a normal logistic regression, there is always a dependent variable (Y) and a set of independent variables (X’s) that can be dichotomous, quantitative or a combination of both. The dependent variable can be dichotomous (in a binary logistic regression) or categorical with multiple categories, which refers to polytomous or multinomial logistic regression.
Binary logistic regression is a technique with which a regression analysis is conducted for a dichotomous dependent variable. It provides a model for the chance that an event occurs dependent on the values of the independent variable. For example, for predicting the response to a treatment for cancer, and the participants can ‘survive’ or ‘not survive’. The categorical independent variables can be both categorical and continuous.
No multicollinearity (when more than two predictive variables correlate strongly).
No errors in the specification. All irrelevant predictor variables are excluded.
The independent variables have to measured on a numeric scale, either ratio or interval.
The errors are independent of each other, so each observation is independent of other observations.
The dependent variable should be binary.
Large sample size, preferably 30 times the number of estimated parameters.
It is good practice to code the presence of a characteristic with 1 and the absence of that characteristic with 0. Variables that are examined, are labelled with 1 (response group, comparison group, purposive group), others with 0 (reference, basic or control group). The aim of a logistic regression is to predict to which group each individual belongs. This is obtained by calculating the chance that the individual belongs to category 1. An advantage of this coding is that the mean of the dependent variable equals the proportion ones in the distribution. The mean is also equal to the chance to label a random person as 1 in a random sample.
With multinomial logistic regression, there are more than two categories for the dependent variable. These are often coded as 1, 2, 3 and so on. The reference group should be identified and the other groups are used as target group in separate analyses.
The graphical display of a linear regression is a line, with which it is assumed that the proportions are constant. If x changes with a certain amount, y changes with a certain amount for all numbers.
With logistic regression, the line is S-shaped. As a result, we can predict the chance on outcome 1, based on the value of the predictor. The first and last values of X hardly result in differences. Difference is present in the middle: the steeper the slope, the more difference is present. A logistic regression is used when the relation is not constant. In those cases, a logistic regression has a high predictive value.
Example: Graph of a logistic regression curve showing probability of passing an exam versus hours studying
To be able to conduct a logistic regression, you first have to transform the data with the natural log transformation. Below, you’ll find some core definitions:
Odds: for a dichotomous variable, the odds of group membership equal the probability of membership of that group divided by the probability of membership of another group. Odds imply how likely it is that an observation belongs to a certain group, compared to another group.
Chances: the chance to belong to one group divided by the chance to not belong to that group = P / (1 – P). It ranges from 0 till infinity.
Odds ratio: another important concept is the odds ratio, which estimates the change in odds of group membership of a target group per one-unit increase of the predictor. The raw coefficient of the predictor variable indicates the change in the natural logarithm of the odds ratio, which is more difficult to interpret than the odds ratio. This raw coefficient does have a useful function: a positive raw coefficient implies that the predicted odds ratio increases when the predictor value increases and vice versa. For a raw coefficient of 0, the odds ratio is 1 (de odds are the same for each value of the predictor).
We want to calculate what the chance is that an individual belongs to a certain group. To do so, the probability of the event is transferred to chances. This is done by taking the natural log (ln). As a result of this transformation, the data fit the S-curve to predict the group membership as good as possible. The logistic regression equation with v independent variables: ln[chances] = grouppred = a + b1X1 + b2X2 + … + bvXv in which grouppred refers to the predicted group membership. The b coefficients give the change in log chances for membership for a change of one unit for the independent variables, controlled by the other predictors. The values of b (slope) and a (constant) are calculated by using the Maximum Likelihood Estimation (MLE), that you can obtain after transforming the dependent variable in the logit. This is a method to change the data to obtain a linear function. The scores are transferred to chances, and then to log odds[llog(p/1-p)] with p the chance on improvement and 1 – p the chance on no improvement. The log odds are positive for odds larger than one and negative for odds smaller than one.
X is the score of the predictor. This can be either 0 or 1 for dichotomous variables or a range of numeric values for quantitative variables. It implies how likely it is that the observed value of the dependent variable can be predicted from the observed values of the independent variables.
The logistic function can be described as P = en / 1 + en. The logistic function has a range of 0 to 1. If n is large and negative, the chance P is small. If n is large and positive, the chance P is large. When it applies that: n = 0, then e0 = 1. The corresponding chance is 1/1+2 = 0.5.
In the logistic function, n is replaced by a linear regression part: P1 = ea + b1x1 + … / 1 + ea + b1x1 + b2x2 + … P1 is the chance of succeeding (success = 1), a is the constant under B (from the SPSS table), b1 and b2 are the corresponding regression coefficients, x1 andx2 are the corresponding predictors. The outcome is interpreted with the following rule: if P1 is equal than or larger than 0.5, the code is 1, if P1 is smaller than 0.5, the code is 0. The chance ratio can be calculated from the e and the b-coefficient: eb = chance ratio.
Log Likelihood Test examines if the set of the independent variables can predict the dependent variables better than on chance alone. The likelihood values are often very small and therefore the natural log is presented in the output. This is calculated by multiplying the log likelihood value with -2, so that the significance can be tested with the X2 test. This the the -2LL (log likehood). It is tested whether at least one predictor has a significant contribution, different from zero. The higher the -2LL, the less well the model fits the data. The 0-model always fits the data least.
To compare models with each other, the model without predictors is compared to the model with one parameter. The difference between the -2LL values indicates the change in X2 that is caused by adding a predictor. The difference can be examined with 1 df. In the Model Summary* (SPSS), the -2LL shows you the strength of the relation. The -2LL is included in the formula of Hosmer and Lemeshow: RL2 = -2LLmodel 0 - -2LLmodel x / -2LL model0. You always compare the current model, for example model 1 or model 2, with the zero model. RL2 gives the proportion reduction in -2LL. For the null model, see the ‘Iteration history’ in SPSS.
The percentage accurate classified cases (PAC) is the number of correct classified cases divided by the total number of classified cases. However, a different measure of accuracy can be used. Sensitivity is the percentage of the target group that is classified correctly. Specificity refers to the percentage of the other group that is classified correctly. The negative predictive value is the percentage that is correctly allocated to the other group by the model. If you want to conduct a good prediction for both groups, the mean predictive value over classes is very useful. Finally, it is important to take the generalizability of the results into account, for example by using a cross-validation sample.