We've updated our Privacy Policy to make it clearer how we use your personal data. We use cookies to provide you with a better experience. You can read our Cookie Policy here.

Advertisement

Logistic Regression

A healthy man vs a man having a heart attack with images representing various lifestyle choices that may have contributed to the outcome.
Credit: Technology Networks.
Listen with
Speechify
0:00
Register for free to listen to this article
Thank you. Listen to this article using the player above.

Want to listen to this article for FREE?

Complete the form below to unlock access to ALL audio articles.

Read time: 11 minutes

Logistic regression is a statistical method used to examine the relationship between a binary outcome variable and one or more explanatory variables. It is a special case of a regression model that predicts the probability of data falling into one of two categories and is often used to calculate odds ratios. This article will cover the basic theory behind this method, the types of logistic regression, when the method is useful and a worked example.



What is logistic regression?

Logistic regression is a powerful statistical method that is used to model the probability that a set of explanatory (independent or predictor) variables predict data in an outcome (dependent or response) variable that takes the form of two categories.


Logistic regression can be thought of as an extension to, or a special case of, linear regression. If the outcome variable is a continuous variable, linear regression is more suitable. The key difference between the two is that logistic regression uses a statistical function (the logistic or sigmoid function) to transform the regression line to fit with the binary outcome (the fact the outcome variable can take only two categories). In other words, it maps the predicted values to the probabilities used to then calculate the model coefficients.


When to use logistic regression

Binomial logistic regression, where the outcome is binary (e.g. death, yes/no) is often simply referred to as logistic regression and will be the focus of this article. For example, a team of medical researchers may want to predict the risk of a heart attack (yes/no) based on a dataset of observed explanatory variables such as age, sex, other medical diagnoses, weight and lifestyle characteristics. Logistic regression is also commonly used in other settings such as economic analysis, market analysis, finance and social sciences.


Multiple logistic regression or multivariable logistic regression

Logistic regression is preferrable over a simpler statistical test such as chi-squared test or Fisher’s exact test as it can incorporate more than one explanatory variable and deals with possible interplay between explanatory variables to predict values of the outcome variable and adjust for them. This is done using multiple or multivariable logistic regression. As with multiple linear regression, the interpretation changes to the coefficient representing an estimate of the association between a particular explanatory variable of interest and a binary outcome variable, while holding all other explanatory variables in the model constant.


Types of logistic regression

Some key types of logistic regression in the statistical literature are summarized below:

  • Binary logistic regression: also referred to as binomial or simply logistic regression, this is when the outcome variable has two categories (e.g. death, yes/no).
  • Multinomial logistic regression: this is when logistic regression is extended to outcome variables made up of three or more unordered categories (e.g. gender, male/female/nonbinary).
  • Multivariable logistic regression: this, in contrast to simple logistic regression where only one explanatory variable is included, is when more than one explanatory variable is modeled and effects of a variable of interest can be estimated while adjusting for other variables.
  • Ordinal logistic regression: this is when the outcome variable is made up of three or more ordered categories (e.g. difficulty level, easy/medium/hard). This is not to be confused with a count variable (e.g. number of accidents in a day, 0/1/2/etc.), where if an outcome takes this scale, logistic regression is unsuitable and a Poisson regression should be used.


Logistic regression formula

The logistic regression model can be represented with the following formula:

The logistic regression formula.


Where the left side of the equation is the probability the outcome variable Y is 1 given the explanatory variables X. The intercept is represented by α, β1 and β2 are the regression coefficients of the model and x1 and x2 are the corresponding explanatory variables. This can be readily extended to include more than two explanatory variables. Finally, e is the base of the natural logarithm.


The formula overall represents the linear combination of the explanatory variables being transformed into a probability using the logistic (or sigmoid) function.


Logistic regression assumptions

Logistic regression relies on the following assumptions about the data:

  • Binary outcome: the dependent variable should take a binary or dichotomous scale, such as death (yes/no) or diagnosis of heart disease (yes/no).
  • Independence: the observations in the dataset should be unrelated to one another and collected as part of a random sample of a population.
  • Linearity of logit: there should be a linear relationship between any continuous explanatory variables and the logit transformation of the outcome variable.
  • No collinearity: explanatory variables in the model should not be highly correlated with each other.  


What is a logit model?

The term logit function is closely related to logistic and refers to the inverse of the logistic function. The logit function transforms the probability (between 0 and 1) into a number that can range between (- infinity) and (+ infinity). The practical reasons for using the logit function are that it allows the relationship between the explanatory and outcome variables to be linearized and analyzed using linear regression techniques. Similarly, it also allows non-linear relationships to be modeled using regression. Importantly, a logit model allows us to produce interpretable coefficients where an odds ratio is the change in the log-odds of the outcome for every unit increase in the explanatory variable.


Visualizing the logistic regression model

It can be useful to visualize the sigmoid function, the key characteristic of a logistic regression model (Figure 1). The purpose of the function is to transform a probability (as a real number) into a range between 0 and 1 and cannot go beyond this limit, which is why it forms an “S”-curve.

Visualization of the sigmoid function on a graph.


Figure 1: Visualization of the sigmoid function.


Logistic regression vs linear regression

There are some key differences between logistic and linear regression in addition to the type of outcome variable analyzed, summarized in Table 1.


Table 1: Summary of some key differences between logistic and linear regression.

Element

Logistic regression

Linear regression

Outcome variable

Models binary outcome variables

Models continuous outcome variables

Regression line

Fits a non-linear S-curve using the sigmoid function

Fits a straight line of best fit

Linearity assumption

Linear relationship not needed/relevant

Linear relationship between the outcome and explanatory variables needed

Estimation

Usually estimated using maximum likelihood estimation (MLE)

Usually estimated using the method of least squares

Coefficients

Model coefficients represent change in log odds of the outcome for every unit increase in an explanatory variable

Model coefficients represent change in outcome variable for every unit increase in an explanatory variable

Logistic regression machine learning

Logistic regression is a statistical tool that forms much of the basis of the field of machine learning and artificial intelligence, including prediction algorithms and neural networks. In machine learning, it is used mainly as a binary classification task where the objective is to predict the probability that an observation belongs to one of two classes. This is closely related to the traditional statistical application of the method, the key difference being that in machine learning, logistic regression is used to develop a model that learns from labeled data (training data) and predicts binary values. Logistic regression is considered a type of supervised machine learning algorithm. Advantages of the method in this setting include that it is interpretable, simple to understand and can be efficiently run on large complex datasets.


Interpreting logistic regression analysis

In a logistic regression model, the coefficients (represented by β in the equation) represent the log odds of the outcome variable being 1 for each one-unit increase in a particular explanatory variable, holding other explanatory variables constant. A more interpretable way to present the coefficients is to take the exponentiated log odds, which gives an odds ratio (OR). This is interpreted as how the odds of an outcome variable being 1 is expected to change (either increase or decrease) for each one-unit increase in the explanatory variable. An OR > 1 suggests an increase in odds, whereas an OR < 1 suggests a decrease. 


Odds, odds ratios and log odds

Binary outcomes allow interpretable coefficients to be calculated as part of logistic regression. Odds of the outcome can be defined as follows:

Equation for calculating odds.


We can produce a ratio to compare the odds of the outcome occurring in each category of an explanatory variable. Assuming we have an explanatory variable consisting of two groups (treated and untreated), the odds ratio is calculated as follows:

Equation for calculating the odds ratio.


Logistic regression models everything on the log odds scale (this is done using the logit function), which means taking the logarithm of the odds in each of the two explanatory groups and of the odds ratio itself. It can be useful to rearrange the previous equation and represent this as follows:

Equation for calculating log odds.


The model coefficients (calculated in log odds) can then be transformed back to the odds scale and obtain odds ratios (OR) – this is the output we are interested in because ORs are interpretable.


Logistic regression example

In a study investigating the effect of Chlamydia trachomatis (C. trachomatis) bacterial infection and blindness, we have a binary explanatory variable (presence of infection, yes/no) and a binary outcome variable (blindness, yes/no).


The relationship between the variables can be summarized in a 2 x 2 table (Table 2).


Table 2: Occurrence of blindness by C. trachomatis infection.

 

 

Infection

 

 

 

No

Yes

Total

Blind

No

280

32

312

 

Yes

16

8

24

 

 

296

40

336

We can calculate the odds of blindness in both infection status as follows:


Odds in infected = 8/32 = 0.25


Odds in uninfected = 16/280 = 0.06


Then calculate the odds ratio by hand:


OR = 0.25/0.06 = 4.17


Alternatively, we could fit a simple logistic regression model (with only one explanatory and one outcome variable) to the data and it will produce the coefficients in log odds form. Normally, a statistical software would provide the log odds ratio (β coefficient) and the log odds in the “baseline” group (the intercept α), which in this case is the log odds in the uninfected (assuming the variable was coded as uninfected = 0 and infected = 1). We can then substitute the values of the explanatory variable into a simplified version of the logistic regression model to find the log odds in the infected:

Calculation of  log odds in the infected group.

Calculating log odds in the infected group with values added in.


Where the log odds in the uninfected can be hand-calculated as log(0.06) = -2.81 and log odds ratio is calculated as log(4.17) = 1.43.


Hence log odds in the infected (when infection status = 1):

Final calculation of log odds in the infected group.


The model can then be readily used to obtain the same OR that we hand calculated above as 4.17. This can be interpreted as infected people having a roughly four times higher odds of developing blindness compared with uninfected people. It should be noted that this is for one explanatory variable only, and when including other variables in a multiple logistic regression (such as age, sex and socioeconomic status), the odds of blindness are likely to change.


A next step may be to conduct a statistical test for the association between infection and blindness, with a Wald test being a commonly used approach. This involves calculating a test-statistic, a z statistic, calculating 95% confidence intervals around the coefficient (which are a quantitative measure of uncertainty around the estimate) and a p-value (a probability of obtaining statistical test results as extreme as the observed results). The z statistic is used to derive the p-value using a z-distribution (a probability distribution) by hand in a look-up table or more commonly using statistical software. In our example, the z statistic derived a small p-value (p < 0.001) indicating strong evidence against the null hypothesis of no association between blindness and infection status.


Further reading