Statistics Tutor: Logistic Regression in R, Matlab, SPSS, SAS, Stata - New York, Chicago, San Francisco, Boston, Los Angeles, London, Toronto

Statistical & Financial Consulting by Stanford PhD

Home Page

LOGISTIC REGRESSION

Logistic Regression is a method used to model a non-linear relationship between a categorical dependent variable and one or more independent variables. The dependent variable is also called the response variable, while the independent variables are called the predictors. The logistic regression model is the following:

                    log( P(Y = 1) / P(Y = K) ) = B₁₀ + B₁₁ X₁ + ... + B_1p X_p

                    log( P(Y = 2) / P(Y = K) ) = B₂₀ + B₂₁ X₁ + ... + B_2p X_p

                                                                ...

                    log( P(Y = K-1) / P(Y = K) ) = B_(K-1)0 + B_(K-1)1 X₁ + ... + B_(K-1)p X_p.(1)

Here Y is the response and X₁, ... , X_p are the predictors. The expressions on the left hand side of equations (1) are known as log-odds. For example, log( P(Y = 2) / P(Y = K) ) are log-odds of observing category 2 versus category K. P(Y = 2) / P(Y = K) are known as odds of observing category 2 versus category K. Equations (1) tell us that, with every unit increase in predictor X_i, the log-odds of seeing category k instead of category K increase by B_ki. Equations (1) can be rephrased in terms of probabilities:

P(Y = k) = exp( B_k0 + B_k1 X₁ + ... + B_kp X_p ) /

/ { 1 + exp( B₁₀ + ... + B_1p X_p) + ... + exp( B_(K-1)0 + ... + B_(K-1)p X_p) }. (2)

Note that logistic regression focuses on the probability of the response variable being a particular value. This is different from linear regression, which focuses on the expectation of the response variable. Linear regression may get the expectation right but misspecify the probabilities due to an incorrect choice of the distribution of residuals. On the other hand, logistic regression may get the probabilities of several categories correctly but misspecify the rest of the distribution.

Logistic regression is a particular case of generalized linear models. In a generalized linear model, the conditional distribution of the response given the predictors can be any standard continuous or discrete distribution. The parameters of the conditional distribution are perfectly specified by non-linear functions of the predictors, which are called link functions. These link functions have the parameters of their own, which are the true parameters of the generalized linear model. They are estimated from the data. Logistic regression corresponds to the so-called logit link function and multinomial conditional distribution of the response.

Just like the rest of generalized linear models, logistic model is estimated by the method of maximum likelihood. Since the model is non-linear, the traditional t-tests and F-tests have no meaning. They are substituted by Wald tests and likelihood ratio tests. These tests are easily calculated as by-products of the maximum likelihood estimation of the model.

LOGISTIC REGRESSION REFERENCES

Balakrishnan, N. (1991). Handbook of the Logistic Distribution. New York: Marcel Dekker, Inc.

Hilbe, J. M. (2009). Logistic Regression Models. Boca Raton, FL: Chapman & Hall / CRC Press.

Greene, W. H. (2011). Econometric Analysis (7th ed). Upper Saddle River, NJ: Prentice Hall.

Efron, B. (1975). The efficiency of logistic regression compared to normal discriminant analysis. J. Amer. Statist. Assoc. 70, pp. 892-898.

Agresti, A. (2002). Categorical Data Analysis. New York: Wiley-Interscience.

Teller, G. R. (1999). Mathematical Statistics: A Unified Introduction. Springer.

BACK TO THE STATISTICAL ANALYSES DIRECTORY

IMPORTANT LINKS ON THIS SITE

Detailed description of the services offered in the areas of statistical consulting and financial consulting: home page, types of service, experience, case studies, payment options and statistics tutoring
Directory of financial topics

consulting@stanfordphd.com