Statistical & Financial Consulting by Stanford PhD
Home Page

Logistic Regression is a method used to model a non-linear relationship between a categorical dependent variable and one or more independent variables. The dependent variable is also called the response variable, while the independent variables are called the predictors. The logistic regression model is the following:

                    log( P(Y = 1) / P(Y = K) ) = B10 + B11 X1 + ... + B1p Xp

                    log( P(Y = 2) / P(Y = K) ) = B20 + B21 X1 + ... + B2p Xp


                    log( P(Y = K-1) / P(Y = K) ) = B(K-1)0 + B(K-1)1 X1 + ... + B(K-1)p Xp.        (1)

Here Y is the response and X1, ... , Xp are the predictors. The expressions on the left hand side of equations (1) are known as log-odds. For example, log( P(Y = 2) / P(Y = K) ) are log-odds of observing category 2 versus category K. P(Y = 2) / P(Y = K) are known as odds of observing category 2 versus category K. Equations (1) tell us that, with every unit increase in predictor Xi, the log-odds of seeing category k instead of category K increase by Bki. Equations (1) can be rephrased in terms of probabilities:

P(Y = k) = exp( Bk0 + Bk1 X1 + ... + Bkp Xp ) / 

/ { 1 + exp( B10 + ... + B1p Xp ) + ... + exp( B(K-1)0 + ... + B(K-1)p Xp ) }.    (2)

Note that logistic regression focuses on the probability of the response variable being a particular value. This is different from linear regression, which focuses on the expectation of the response variable. Linear regression may get the expectation right but misspecify the probabilities due to an incorrect choice of the distribution of residuals. On the other hand, logistic regression may get the probabilities of several categories correctly but misspecify the rest of the distribution.

Logistic regression is a particular case of generalized linear models. In a generalized linear model, the conditional distribution of the response given the predictors can be any standard continuous or discrete distribution. The parameters of the conditional distribution are perfectly specified by non-linear functions of the predictors, which are called link functions. These link functions have the parameters of their own, which are the true parameters of the generalized linear model. They are estimated from the data. Logistic regression corresponds to the so-called logit link function and multinomial conditional distribution of the response.

Just like the rest of generalized linear models, logistic model is estimated by the method of maximum likelihood. Since the model is non-linear, the traditional t-tests and F-tests have no meaning. They are substituted by Wald tests and likelihood ratio tests. These tests are easily calculated as by-products of the maximum likelihood estimation of the model.


Balakrishnan, N. (1991). Handbook of the Logistic Distribution. New York: Marcel Dekker, Inc.

Hilbe, J. M. (2009). Logistic Regression Models. Boca Raton, FL: Chapman & Hall / CRC Press.

Greene, W. H. (2011). Econometric Analysis (7th ed). Upper Saddle River, NJ: Prentice Hall.

Efron, B. (1975). The efficiency of logistic regression compared to normal discriminant analysis. J. Amer. Statist. Assoc. 70, pp. 892-898.

Agresti, A. (2002). Categorical Data Analysis. New York: Wiley-Interscience.

Teller, G. R. (1999). Mathematical Statistics: A Unified Introduction. Springer.