Applying logistic regression


Logistic regression

This page is about logistic regression with a categorical dependent variable and quantitative or dichotomous independent variables.

In a normal logistic regression, there is always a dependent variable (Y) and a set of independent variables (X’s) that can be dichotomous, quantitative or a combination of both. The dependent variable can be dichotomous (in a binary logistic regression) or categorical with multiple categories, which refers to polytomous or multinomial logistic regression.

Binary logistic regression is a technique with which a regression analysis is conducted for a dichotomous dependent variable. It provides a model for the chance that an event occurs dependent on the values of the independent variable. For example, for predicting the response to a treatment for cancer, and the participants can ‘survive’ or ‘not survive’. The categorical independent variables can be both categorical and continuous.

Assumptions logistic regression

  • No multicollinearity (when more than two predictive variables correlate strongly).

  • No errors in the specification. All irrelevant predictor variables are excluded.

  • The independent variables have to measured on a numeric scale, either ratio or interval.

  • The errors are independent of each other, so each observation is independent of other observations.

  • The dependent variable should be binary.

  • Large sample size, preferably 30 times the number of estimated parameters.

Coding binary variables

It is good practice to code the presence of a characteristic with 1 and the absence of that characteristic with 0. Variables that are examined, are labelled with 1 (response group, comparison group, purposive group), others with 0 (reference, basic or control group). The aim of a logistic regression is to predict to which group each individual belongs. This is obtained by calculating the chance that the individual belongs to category 1. An advantage of this coding is that the mean of the dependent variable equals the proportion ones in the distribution. The mean is also equal to the chance to label a random person as 1 in a random sample.

  • P = proportion ones.

  • Q = proportion zero’s (1 – P)

  • PQ = variance

  • PQ = standard deviation

 

With multinomial logistic regression, there are more than two categories for the dependent variable. These are often coded as 1, 2, 3 and so on. The reference group should be identified and the other groups are used as target group in separate analyses.

Graphical displaying logistic regression

The graphical display of a linear regression is a line, with which it is assumed that the proportions are constant. If x changes with a certain amount, y changes with a certain amount for all numbers.

With logistic regression, the line is S-shaped. As a result, we can predict the chance on outcome 1, based on the value of the predictor. The first and last values of X hardly result in differences. Difference is present in the middle: the steeper the slope, the more difference is present. A logistic regression is used when the relation is not constant. In those cases, a logistic regression has a high predictive value.

Example: Graph of a logistic regression curve showing probability of passing an exam versus hours studying

Logistic regression and odds

To be able to conduct a logistic regression, you first have to transform the data with the natural log transformation. Below, you’ll find some core definitions:

  • Odds: for a dichotomous variable, the odds of group membership equal the probability of membership of that group divided by the probability of membership of another group. Odds imply how likely it is that an observation belongs to a certain group, compared to another group.

  • Chances: the chance to belong to one group divided by the chance to not belong to that group = P / (1 – P). It ranges from 0 till infinity.

  • Odds ratio: another important concept is the odds ratio, which estimates the change in odds of group membership of a target group per one-unit increase of the predictor. The raw coefficient of the predictor variable indicates the change in the natural logarithm of the odds ratio, which is more difficult to interpret than the odds ratio. This raw coefficient does have a useful function: a positive raw coefficient implies that the predicted odds ratio increases when the predictor value increases and vice versa. For a raw coefficient of 0, the odds ratio is 1 (de odds are the same for each value of the predictor).

We want to calculate what the chance is that an individual belongs to a certain group. To do so, the probability of the event is transferred to chances. This is done by taking the natural log (ln). As a result of this transformation, the data fit the S-curve to predict the group membership as good as possible. The logistic regression equation with v independent variables: ln[chances] = grouppred = a + b1X1 + b2X2 + … + bvXv in which grouppred refers to the predicted group membership. The b coefficients give the change in log chances for membership for a change of one unit for the independent variables, controlled by the other predictors. The values of b (slope) and a (constant) are calculated by using the Maximum Likelihood Estimation (MLE), that you can obtain after transforming the dependent variable in the logit. This is a method to change the data to obtain a linear function. The scores are transferred to chances, and then to log odds[llog(p/1-p)] with p the chance on improvement and 1 – p the chance on no improvement. The log odds are positive for odds larger than one and negative for odds smaller than one.

X is the score of the predictor. This can be either 0 or 1 for dichotomous variables or a range of numeric values for quantitative variables. It implies how likely it is that the observed value of the dependent variable can be predicted from the observed values of the independent variables.

The logistic function can be described as P = en / 1 + en. The logistic function has a range of 0 to 1. If n is large and negative, the chance P is small. If n is large and positive, the chance P is large. When it applies that: n = 0, then e0 = 1. The corresponding chance is 1/1+2 = 0.5.

In the logistic function, n is replaced by a linear regression part: P1 = ea + b1x1 + … / 1 + ea + b1x1 + b2x2 + … P1 is the chance of succeeding (success = 1), a is the constant under B (from the SPSS table), b1 and b2 are the corresponding regression coefficients, x1 andx2 are the corresponding predictors. The outcome is interpreted with the following rule: if P1 is equal than or larger than 0.5, the code is 1, if P1 is smaller than 0.5, the code is 0. The chance ratio can be calculated from the e and the b-coefficient: eb = chance ratio.

Evaluation of the logistic model

Log Likelihood Test examines if the set of the independent variables can predict the dependent variables better than on chance alone. The likelihood values are often very small and therefore the natural log is presented in the output. This is calculated by multiplying the log likelihood value with -2, so that the significance can be tested with the X2 test. This the the -2LL (log likehood). It is tested whether at least one predictor has a significant contribution, different from zero. The higher the -2LL, the less well the model fits the data. The 0-model always fits the data least.

To compare models with each other, the model without predictors is compared to the model with one parameter. The difference between the -2LL values indicates the change in X2 that is caused by adding a predictor. The difference can be examined with 1 df. In the Model Summary* (SPSS), the -2LL shows you the strength of the relation. The -2LL is included in the formula of Hosmer and Lemeshow: RL2 = -2LLmodel 0 - -2LLmodel x / -2LL model0. You always compare the current model, for example model 1 or model 2, with the zero model. RL2 gives the proportion reduction in -2LL. For the null model, see the ‘Iteration history’ in SPSS.

Classification analysis

The percentage accurate classified cases (PAC) is the number of correct classified cases divided by the total number of classified cases. However, a different measure of accuracy can be used. Sensitivity is the percentage of the target group that is classified correctly. Specificity refers to the percentage of the other group that is classified correctly. The negative predictive value is the percentage that is correctly allocated to the other group by the model. If you want to conduct a good prediction for both groups, the mean predictive value over classes is very useful. Finally, it is important to take the generalizability of the results into account, for example by using a cross-validation sample.

Statistics: Magazines for encountering Statistics

Statistics: Magazines for encountering Statistics

Startmagazine: Introduction to Statistics
Stats for students: Simple steps for passing your statistics courses

Stats for students: Simple steps for passing your statistics courses

Image

Stats of studentsTheory of statistics

  • The first years that you follow statistics, it is often a case of taking knowledge for granted and simply trying to pass the courses. Don't worry if you don't understand everything right away: in later years it will fall into place and you will see the importance of the theory you had to know before.
  • The book you need to study may be difficult to understand at first. Be patient: later in your studies, the effort you put in now will pay off.
  • Be a Gestalt Scientist! In other words, recognize that the whole of statistics is greater than the sum of its parts. It is very easy to get hung up on nit-picking details and fail to see the forest because of the trees
  • Tip: Precise use of language is important in research. Try to reproduce the theory verbatim (ie. learn by heart) where possible. With that, you don't have to understand it yet, you show that you've been working on it, you can't go wrong by using the wrong word and you practice for later reporting of research.
  • Tip: Keep study material, handouts, sheets, and other publications from your teacher for future reference.

Formulas of statistics

  • The direct relationship between data and results consists of mathematical formulas. These follow their own logic, are written in their own language and can therefore be complex to comprehend.
  • If you don't understand the math behind statistics, you don't understand statistics. This does not have to be a problem, because statistics is an applied science from which you can also get excellent results without understanding. None of your teachers will understand all the statistical formulas.
  • Please note: you will have to know and understand a number of formulas, so that you can demonstrate that you know the principle of how statistics work. Which formulas you need to know differs from subject to subject and lecturer to lecturer, but in general these are relatively simple formulas that occur frequently and your lecturer will tell you (often several times) that you should know this formula.
  • Tip: if you want to recognize statistical symbols you can use: Recognizing commonly used statistical symbols
  • Tip: have fun with LaTeX! LaTeX code gives us a simple way to write out mathematical formulas and make them look professional. Play with LaTeX. Wit that, you can include used formulas in your own papers and you learn to understand how a formula is built up – which greatly benefits your understanding and remembering that formula. See also (in Dutch): How to create formulas like a pro on JoHo WorldSupporter?
  • Tip: Are you interested in a career in sciences or programming? Then take your formulas seriously and go through them again after your course.

Practice of statistics

Selecting data

  • Your teacher will regularly use a dataset for lessons during the first years of your studying. It is instructive (and can be a lot of fun) to set up your own research for once with real data that is also used by other researchers.
  • Tip: scientific articles often indicate which datasets have been used for the research. There is a good chance that those datasets are valid. Sometimes there are also studies that determine which datasets are more valid for the topic you want to study than others. Make use of datasets other researchers point out.
  • Tip: Do you want an interesting research result? You can use the same method and question, but use an alternative dataset, and/or alternative variables, and/or alternative location, and/or alternative time span. This allows you to validate or falsify the results of earlier research.
  • Tip: for datasets you can look at Discovering datasets for statistical research

Operationalize

  • For the operationalization, it is usually sufficient to indicate the following three things:
    • What is the concept you want to study?
    • Which variable does that concept represent?
    • Which indicators do you select for those variables?
  • It is smart to argue that a variable is valid, or why you choose that indicator.
  • For example, if you want to know whether someone is currently a father or mother (concept), you can search the variables for how many children the respondent has (variable) and then select on the indicators greater than 0, or is not 0 (indicators). Where possible, use the terms 'concept', 'variable', 'indicator' and 'valid' in your communication. For example, as follows: “The variable [variable name] is a valid measure of the concept [concept name] (if applicable: source). The value [description of the value] is an indicator of [what you want to measure].” (ie.: The variable "Number of children" is a valid measure of the concept of parenthood. A value greater than 0 is an indicator of whether someone is currently a father or mother.)

Running analyses and drawing conclusions

  • The choice of your analyses depends, among other things, on what your research goal is, which methods are often used in the existing literature, and practical issues and limitations.
  • The more you learn, the more independently you can choose research methods that suit your research goal. In the beginning, follow the lecturer – at the end of your studies you will have a toolbox with which you can vary in your research yourself.
  • Try to link up as much as possible with research methods that are used in the existing literature, because otherwise you could be comparing apples with oranges. Deviating can sometimes lead to interesting results, but discuss this with your teacher first.
  • For as long as you need, keep a step-by-step plan at hand on how you can best run your analysis and achieve results. For every analysis you run, there is a step-by-step explanation of how to perform it; if you do not find it in your study literature, it can often be found quickly on the internet.
  • Tip: Practice a lot with statistics, so that you can show results quickly. You cannot learn statistics by just reading about it.
  • Tip: The measurement level of the variables you use (ratio, interval, ordinal, nominal) largely determines the research method you can use. Show your audience that you recognize this.
  • Tip: conclusions from statistical analyses will never be certain, but at the most likely. There is usually a standard formulation for each research method with which you can express the conclusions from that analysis and at the same time indicate that it is not certain. Use that standard wording when communicating about results from your analysis.
  • Tip: see explanation for various analyses: Introduction to statistics
Statistics: Magazines for understanding statistics

Statistics: Magazines for understanding statistics

Startmagazine: Introduction to Statistics
Understanding data: distributions, connections and gatherings
Understanding reliability and validity
Statistics Magazine: Understanding statistical samples
Understanding variability, variance and standard deviation
Understanding inferential statistics
Understanding type-I and type-II errors
Statistiek: samenvattingen en studiehulp - Thema
Statistics: Magazines for applying statistics

Statistics: Magazines for applying statistics

Applying z-tests and t-tests
Applying correlation, regression and linear regression
Applying spearman's correlation
Statistiek: samenvattingen en studiehulp - Thema

More knowledge and assistance for Encountering, Understanding and Applying Statistics

Saying 'Yes' to statistics

Selected contributions for Understanding logistic regression
What can you do on a WorldSupporter Statistics Topic?
Updates of WorldSupporter Statistics
This content is used in bundle:

Statistics: Magazines for applying statistics

Applying z-tests and t-tests
Applying correlation, regression and linear regression
Applying spearman's correlation
Statistiek: samenvattingen en studiehulp - Thema

Selected contributions for Understanding logistic regression

What is logistic regression? – Chapter 15

What is logistic regression? – Chapter 15


15.1 What are the basics of logistic regression?

A logistic regression model is a model with a binary response variable (like 'agree' or 'don't agree'). It's also possible for logistic regression models to have ordinal or nominal response variables. The mean is the proportion of responses that are 1. The linear probability model is P(y=1) = α + βx. This model often is too simple, a more extended version is:

The logarithm can be calculated using software. The odds are:: P(y=1)/[1-P(y=1)]. The log of the odds, or logistic transformation (abbreviated as logit) is the logistic regression model: logit[P(y=1)] = α + βx.

To find the outcome for a certain value of a predictor, the following formula is used:

The e to a certain power is the antilog of that number.

A straight line is drawn next to the curve of a logistic graph to analyze it. β is maximal where P(y=1) = ½. For logistic regression the maximal likelihood method is used instead of the least squares method. The model expressed in odds is:

The estimate is:

With this the odds ratio can be calculated.

There are two possibilities to present the data. For ungrouped data a normal contingency table suffices. For grouped data a row contains data for every count in a cel, like just one row with the number of subjects that agreed, followed by the total number of subjects.

An alternative of the logit is the probit. This link assumes a hidden, underlying continuous variable y* that is 1 above a certain value T (threshold) and that is 0 below T. Because y* is hidden, it's called a latent variable. However, it can be used to make a probit model: probit[P(y=1)] = α + βx.

Logistic regression with repeated measures and random effects is analyzed with a linear mixed model: logit[P(yij = 1)] = α + βxij + si.

15.2 What does multiple logistic regression look like?

The multiple logistic regression model is: logit[P(y = 1)] = α + β1x1 + … + βpxp. The further βi is from 0, the stronger

.....read more
Access: 
Public
Categorical outcomes: logistic regression - summary of (part of) chapter 20 of Statistics by A. Field

Categorical outcomes: logistic regression - summary of (part of) chapter 20 of Statistics by A. Field

Image

Discovering statistics using IBM SPSS statistics
Chapter 20
Categorical outcomes: logistic regression

This summary contains the information from chapter 20.8 and forward, the rest of the chapter is not necessary for the course.


What is logistic regression?

Logistic regression is a model for predicting categorical outcomes from categorical and continuous predictors.

A binary logistic regression is when we’re trying to predict membership of only two categories.
Multinominal is when we want to predict membership of more than two categories.

Theory of logistic regression

The linear model can be expressed as: Yi = b0 + b1Xi + errori

b0 is the value of the outcome when the predictors are zero (the intercept).
The bs quantify the relationship between each predictor and outcome.
X is the value of each predictor variable.

One of the assumptions of the linear model is that the relationship between the predictors and outcome is linear.
When the outcome variable is categorical, this assumption is violated.
One way to solve this problem is to transform the data using the logarithmic transformation, where you can express a non-linear relationship in a linear way.

In logistic regression, we predict the probability of Y occurring, P(Y) from known (logtransformed) values of X1 (or Xs).
The logistic regression model with one predictor is:
P(Y) = 1/(1+e –(b0 +b1X1i))
The value of the model will lie between 1 and 0.

Testing assumptions

You need to test for

  • Linearity of the logit
    You need to check that each continuous variable is linearly related to the log of the outcome variable.
    If this is significant, it indicates that the main effect has violated the assumption of linearity of the logic.
  • Multicollinearity
    This has a biasing effect

Predicting several categories: multinomial logistic regression

Multinomial logistic regression predicts membership of more than two categories.
The model breaks the outcome variable into a series of comparisons between two categories.
In practice, you have to set a baseline outcome category.

Access: 
Public
MVDA - logistic regression analysis

MVDA - logistic regression analysis

Image

Week 4: Logistic Regression Analysis (LRA)

LRA can be used when the dependent variable (Y) is binary and the predictors (X1, X2) interval level (or binary).

The research question is: Can Y be predicted fromX1and/orX2?

  • Example: Can the passing (1) or failing (0) the MVDA exam (Y) be predicted from the student’s grade on the psychometrics exam (X)?

Is there a significant association between grade and passing/failing the exam? (report test statistic, df, and p value)?

Here, we look at the Variables in the Equation table at the Wald of the grade. If it’s significant, then yes there is a significant association. An example of how this can be reported:

Yes, Wald  χ2(1) = 7.090,p=.006

Write down the logistic regression equation

For example:

if the constant B is -4.200

the grade B is: .671

Then the equation looks like this:

(From now on, sorry for the weird format of the formulas)

For what grade is the probability of passing the MVDA exam equal to the probability of failing the MVDA exam?

Passing= 50%

Failing=50%

P=1/2 =

In order for  to be 1, -4.200+ .671(Grade) has to be equal to 0. This is because e to the power of 0 is 1.

So, -4.200 + 0.671(g)=0

0.671(g)=4.200

g=6.259

Therefore, the grade where there is an equal chance for passing and failing is 6.259.

Calculate the probabilities and odds of passing for X= 0,5, 10

X                      P                                              Odds (rounded up)              

0                      =0.0148                           = = 0.015                    

5                    = 0.3005                            = 0.429                              

10                   = 0.9248                           =11.5

 

How to calculate the odds ratio?

Example:

X                      P                      Odds               Odds ratio

1                   .0285           .02931          =1.958

2                   .0543            .0574            1.958

Therefore, if X increases 1 unit, the odds are going to increase by x 1.958 (times 1.958).

 

What is the odds ratio of X of

.....read more
Access: 
Public
Call to action: Do you have statistical knowledge and skills and do you enjoy helping others while expanding your international network?

Call to action: Do you have statistical knowledge and skills and do you enjoy helping others while expanding your international network?

People who share their statistical knowledge and skills can contact WorldSupporter Statistics for more exposure to a larger audience. Relevant contributions to specific WorldSupporter Statistics Topics are highlighted per topic so that users who are interested in certain statistical topics can broaden their theoretical perspective and international network.

Do you have statistical knowledge and skills and do you enjoy helping others while expanding your international network? Would you like to cooperate with WorldSupporter Statistics? Please send us an e-mail with some basics (Where do you live? What's your (statistical) background? How are you helping others at the moment? And how do you see that in relation to WorldSupporter Statistics?) to info@joho.org - and we will most definitely be in touch.

Crossroads: activities, countries, competences, study fields and goals
Activity abroad, study field of working area:
Comments, Compliments & Kudos

Add new contribution

CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image.
Access level of this page
  • Public
  • WorldSupporters only
  • JoHo members
  • Private
Statistics
5440