Logistic Regression

Q: What is a logit model?

The term logit function is closely related to logistic and refers to the inverse of the logistic function. The logit function transforms the probability (between 0 and 1) into a number that can range between (- infinity) and (+ infinity). The practical reasons for using the logit function are that it allows the relationship between the explanatory and outcome variables to be linearized and analyzed using linear regression techniques. Similarly, it also allows non-linear relationships to be modeled using regression. Importantly, a logit model allows us to produce interpretable coefficients where an odds ratio is the change in the log-odds of the outcome for every unit increase in the explanatory variable.

Explore relationships between factors with binary outcomes

Article

Published: March 4, 2025

Elliot McClenaghan

A healthy man vs a man having a heart attack with images representing various lifestyle choices that may have contributed to the outcome.

Credit: Technology Networks.

Listen with

Speechify

0:00

Thank you. Listen to this article using the player above. ✖

Want to listen to this article for FREE?

Complete the form below to unlock access to ALL audio articles.

Read time: 11 minutes

Logistic regression is a statistical method used to examine the relationship between a binary outcome variable and one or more explanatory variables. It is a special case of a regression model that predicts the probability of data falling into one of two categories and is often used to calculate odds ratios. This article will cover the basic theory behind this method, the types of logistic regression, when the method is useful and a worked example.

What is logistic regression?

When to use logistic regression

Multiple logistic regression or multivariable logistic regression

Types of logistic regression

Logistic regression formula

Logistic regression assumptions

What is a logit model?

Visualizing the logistic regression model

Logistic regression vs linear regression

Logistic regression machine learning

Interpreting logistic regression analysis

Odds, odds ratios and log odds

Logistic regression example

What is logistic regression?

Logistic regression is a powerful statistical method that is used to model the probability that a set of explanatory (independent or predictor) variables predict data in an outcome (dependent or response) variable that takes the form of two categories.

Logistic regression can be thought of as an extension to, or a special case of, linear regression. If the outcome variable is a continuous variable, linear regression is more suitable. The key difference between the two is that logistic regression uses a statistical function (the logistic or sigmoid function) to transform the regression line to fit with the binary outcome (the fact the outcome variable can take only two categories). In other words, it maps the predicted values to the probabilities used to then calculate the model coefficients.

When to use logistic regression

Binomial logistic regression, where the outcome is binary (e.g. death, yes/no) is often simply referred to as logistic regression and will be the focus of this article. For example, a team of medical researchers may want to predict the risk of a heart attack (yes/no) based on a dataset of observed explanatory variables such as age, sex, other medical diagnoses, weight and lifestyle characteristics. Logistic regression is also commonly used in other settings such as economic analysis, market analysis, finance and social sciences.

Multiple logistic regression or multivariable logistic regression

Logistic regression is preferrable over a simpler statistical test such as chi-squared test or Fisher’s exact test as it can incorporate more than one explanatory variable and deals with possible interplay between explanatory variables to predict values of the outcome variable and adjust for them. This is done using multiple or multivariable logistic regression. As with multiple linear regression, the interpretation changes to the coefficient representing an estimate of the association between a particular explanatory variable of interest and a binary outcome variable, while holding all other explanatory variables in the model constant.

Types of logistic regression

Some key types of logistic regression in the statistical literature are summarized below:

Binary logistic regression: also referred to as binomial or simply logistic regression, this is when the outcome variable has two categories (e.g. death, yes/no).
Multinomial logistic regression: this is when logistic regression is extended to outcome variables made up of three or more unordered categories (e.g. gender, male/female/nonbinary).
Multivariable logistic regression: this, in contrast to simple logistic regression where only one explanatory variable is included, is when more than one explanatory variable is modeled and effects of a variable of interest can be estimated while adjusting for other variables.
Ordinal logistic regression: this is when the outcome variable is made up of three or more ordered categories (e.g. difficulty level, easy/medium/hard). This is not to be confused with a count variable (e.g. number of accidents in a day, 0/1/2/etc.), where if an outcome takes this scale, logistic regression is unsuitable and a Poisson regression should be used.

Logistic regression formula

The logistic regression model can be represented with the following formula:

Where the left side of the equation is the probability the outcome variable Y is 1 given the explanatory variables X. The intercept is represented by α, β₁ and β₂ are the regression coefficients of the model and x₁ and x₂are the corresponding explanatory variables. This can be readily extended to include more than two explanatory variables. Finally, e is the base of the natural logarithm.

The formula overall represents the linear combination of the explanatory variables being transformed into a probability using the logistic (or sigmoid) function.

Logistic regression assumptions

Logistic regression relies on the following assumptions about the data:

Binary outcome: the dependent variable should take a binary or dichotomous scale, such as death (yes/no) or diagnosis of heart disease (yes/no).
Independence: the observations in the dataset should be unrelated to one another and collected as part of a random sample of a population.
Linearity of logit: there should be a linear relationship between any continuous explanatory variables and the logit transformation of the outcome variable.
No collinearity: explanatory variables in the model should not be highly correlated with each other.

What is a logit model?

The term logit function is closely related to logistic and refers to the inverse of the logistic function. The logit function transforms the probability (between 0 and 1) into a number that can range between (- infinity) and (+ infinity). The practical reasons for using the logit function are that it allows the relationship between the explanatory and outcome variables to be linearized and analyzed using linear regression techniques. Similarly, it also allows non-linear relationships to be modeled using regression. Importantly, a logit model allows us to produce interpretable coefficients where an odds ratio is the change in the log-odds of the outcome for every unit increase in the explanatory variable.

Visualizing the logistic regression model

It can be useful to visualize the sigmoid function, the key characteristic of a logistic regression model (Figure 1). The purpose of the function is to transform a probability (as a real number) into a range between 0 and 1 and cannot go beyond this limit, which is why it forms an “S”-curve.

Visualization of the sigmoid function on a graph.

Figure 1: Visualization of the sigmoid function.

Logistic regression vs linear regression

There are some key differences between logistic and linear regression in addition to the type of outcome variable analyzed, summarized in Table 1.

Table 1: Summary of some key differences between logistic and linear regression.

Element	Logistic regression	Linear regression
Outcome variable	Models binary outcome variables	Models continuous outcome variables
Regression line	Fits a non-linear S-curve using the sigmoid function	Fits a straight line of best fit
Linearity assumption	Linear relationship not needed/relevant	Linear relationship between the outcome and explanatory variables needed
Estimation	Usually estimated using maximum likelihood estimation (MLE)	Usually estimated using the method of least squares
Coefficients	Model coefficients represent change in log odds of the outcome for every unit increase in an explanatory variable	Model coefficients represent change in outcome variable for every unit increase in an explanatory variable

Logistic regression machine learning

Logistic regression is a statistical tool that forms much of the basis of the field of machine learning and artificial intelligence, including prediction algorithms and neural networks. In machine learning, it is used mainly as a binary classification task where the objective is to predict the probability that an observation belongs to one of two classes. This is closely related to the traditional statistical application of the method, the key difference being that in machine learning, logistic regression is used to develop a model that learns from labeled data (training data) and predicts binary values. Logistic regression is considered a type of supervised machine learning algorithm. Advantages of the method in this setting include that it is interpretable, simple to understand and can be efficiently run on large complex datasets.

Interpreting logistic regression analysis

In a logistic regression model, the coefficients (represented by β in the equation) represent the log odds of the outcome variable being 1 for each one-unit increase in a particular explanatory variable, holding other explanatory variables constant. A more interpretable way to present the coefficients is to take the exponentiated log odds, which gives an odds ratio (OR). This is interpreted as how the odds of an outcome variable being 1 is expected to change (either increase or decrease) for each one-unit increase in the explanatory variable. An OR > 1 suggests an increase in odds, whereas an OR < 1 suggests a decrease.

Odds, odds ratios and log odds

Binary outcomes allow interpretable coefficients to be calculated as part of logistic regression. Odds of the outcome can be defined as follows:

We can produce a ratio to compare the odds of the outcome occurring in each category of an explanatory variable. Assuming we have an explanatory variable consisting of two groups (treated and untreated), the odds ratio is calculated as follows:

Equation for calculating the odds ratio.

Logistic regression models everything on the log odds scale (this is done using the logit function), which means taking the logarithm of the odds in each of the two explanatory groups and of the odds ratio itself. It can be useful to rearrange the previous equation and represent this as follows:

The model coefficients (calculated in log odds) can then be transformed back to the odds scale and obtain odds ratios (OR) – this is the output we are interested in because ORs are interpretable.

Logistic regression example

In a study investigating the effect of Chlamydia trachomatis (C. trachomatis) bacterial infection and blindness, we have a binary explanatory variable (presence of infection, yes/no) and a binary outcome variable (blindness, yes/no).

The relationship between the variables can be summarized in a 2 x 2 table (Table 2).

Table 2: Occurrence of blindness by C. trachomatis infection.

		Infection
		No	Yes	Total
Blind	No	280	32	312
	Yes	16	8	24
		296	40	336

We can calculate the odds of blindness in both infection status as follows:

Odds in infected = 8/32 = 0.25

Odds in uninfected = 16/280 = 0.06

Then calculate the odds ratio by hand:

OR = 0.25/0.06 = 4.17

Alternatively, we could fit a simple logistic regression model (with only one explanatory and one outcome variable) to the data and it will produce the coefficients in log odds form. Normally, a statistical software would provide the log odds ratio (β coefficient) and the log odds in the “baseline” group (the intercept α), which in this case is the log odds in the uninfected (assuming the variable was coded as uninfected = 0 and infected = 1). We can then substitute the values of the explanatory variable into a simplified version of the logistic regression model to find the log odds in the infected:

Where the log odds in the uninfected can be hand-calculated as log(0.06) = -2.81 and log odds ratio is calculated as log(4.17) = 1.43.

Hence log odds in the infected (when infection status = 1):

The model can then be readily used to obtain the same OR that we hand calculated above as 4.17. This can be interpreted as infected people having a roughly four times higher odds of developing blindness compared with uninfected people. It should be noted that this is for one explanatory variable only, and when including other variables in a multiple logistic regression (such as age, sex and socioeconomic status), the odds of blindness are likely to change.

A next step may be to conduct a statistical test for the association between infection and blindness, with a Wald test being a commonly used approach. This involves calculating a test-statistic, a z statistic, calculating 95% confidence intervals around the coefficient (which are a quantitative measure of uncertainty around the estimate) and a p-value (a probability of obtaining statistical test results as extreme as the observed results). The z statistic is used to derive the p-value using a z-distribution (a probability distribution) by hand in a look-up table or more commonly using statistical software. In our example, the z statistic derived a small p-value (p < 0.001) indicating strong evidence against the null hypothesis of no association between blindness and infection status.

Further reading

Geeksforgeeks. Logistic regression in machine learning. Geeksforgeeks. https://www.geeksforgeeks.org/understanding-logistic-regression/. Published 2024. Accessed Jan 29, 2025.
Bland M. An Introduction to Medical Statistics (4th ed.). Oxford. Oxford University Press; 2015. ISBN:9780199589920
Jurafsky D, Martin JH. Chapter 5: Logistic regression. Speech and Language Processing. 3^rd ed. 2025. Accessed Jan 29, 2025. https://web.stanford.edu/~jurafsky/slp3/5.pdf
Nick TG, Campbell KM. Logistic regression. In: Ambrosius WT. Ed. Topics in Biostatistics. Methods in Molecular Biology™, vol 404. Humana Press; 2007. doi:10.1007/978-1-59745-530-5_14

Meet the Author