Statistical & Financial Consulting by Stanford PhD
Home Page
MULTIPLE LINEAR REGRESSION

Multiple Linear Regression (MLR) is a method used to model the linear relationship between a dependent variable and one or more independent variables. The dependent variable is also called the response variable, while the independent variables are called the predictors. The regression model is the following:

$Y = B_0 + B_1 X_1 + ... + B_p X_p + e.$

Here Y is the response, X1, ... , Xp are the predictors and e is random variable with zero expectation. Variable e is called residual. It denotes the error in predicting Y with the help of X1, ... , Xp.

In the most general set-up, predictors X1, ..., Xp are assumed to be stochastic, and residual e has zero expectation, conditional on X1, ..., Xp. Variable e does not have to be independent of X1, ... , Xp. In fact, it is the error of measurement, so its variance may depend on various relevant factors, like the ones recorded in X1, ... , Xp.

To estimate the regression model we observe several realizations of random vector (Y, X1, ... , Xp). Note that variable e is not observed. However, if we find a way to estimate regression coefficients B0, ... , Bp accurately, we can arrive at an accurate estimate of e. Let us denote the realizations of vector (Y, X1, ..., Xp) as (Y1, X11, ..., Xp1), ... , (Yn, X1n, ..., Xpn). There are unobserved residuals e1, ... , en, corresponding to the realizations.

1] If e1, ... , en are assumed to be independent and having the same variance, the regression model is estimated using the method of Ordinary Least Squares (OLS).

2] If e1, ... , en are assumed to be independent and having possibly different variances, the regression model is estimated using the method of Weighted Least Squares (WLS).

3] If e1, ... , en are assumed to be correlated and having possibly different variances, the regression model is estimated using the method of Generalized Least Squares (GLS).

The data do not have to be jointly normal (Gaussian) for the estimation to work. In fact the following is true: if the residuals are normal conditional on predictors (X11, ..., Xp1), ... , (X1n, ..., Xpn), then the OLS / WLS / GLS method is most statistically efficient (accurate). Depending on whether the correlation and variance structure of residuals e1, ... , en is known, variations of WLS and GLS are used. Note that X1, ..., Xp and Y do not have to be normal.

After the estimation has been done, it's the time for diagnostics. The estimates of regression coefficients B0, ... , Bp as well as the estimates of correlations and variances of the residuals are used in the following procedures.

1] Whether the true value of coefficient Bi is 0 is determined by a t-test. If Bi = 0, then predictor Xi does not have any linear predictive power and must be dropped from the model. Here we assume that the other regression coefficients equal their estimates based on the given data set. If we conclude that Bi ≠ 0, then predictor Xi is called statistically significant.

2] Whether true values of several coefficients Bi1, ... , Bik are 0 is determined by a F-test. If Bi1 = 0, ... , Bik = 0, then predictors Xi1, ... , Xik does not have any linear predictive power and must be dropped from the model. Again, we assume that the other regression coefficients equal their estimates based on the given data set.

3] Whether predictors X1, ..., Xp explain big part of variation in Y is determined using the R-squared, the adjusted R-squared, Cp and other statistics.

4] The diagnostics of outliers and high-leverage points is performed using Cook's distance, the leverage statistic and other relevant metrics. If flawed data points are detected they are dropped from the data set.

As the next stage, we experiment with adding new and dropping some of the old predictors to see what the resulting models are. At the end, we want to have a collection of candidate models, where each predictor is significant. Now we choose the best model using one of the standard model selection methods. This model can be used for prediction on a completely new data set.

Note, that multiple linear regression is called "linear" because of the linear dependence of the response on the regression coefficients. The relationship between the response and the predictors is not important. Even if the true relationship is the following:

$Y = B_0 + B_1 \log(X_1),$

we can denote X'1 = log(X1) and make the model linear in the predictor:

$Y = B_0 + B_1 X_1'.$

The new notation does not change the estimates of coefficients B0 and B1 at all.

MULTIPLE LINEAR REGRESSION REFERENCES

Freedman, D., Pisani, R., & Purves, R. (1998). Statistics (3rd ed). New York: W. W. Norton & Company.

Dekking, F. M., Kraaikamp, C., Lopuhaä, H. P., & Meester, M. E. (2007). A Modern Introduction to Probability and Statistics: Understanding Why and How (3rd ed). London: Springer.

Greene, W. H. (2011). Econometric Analysis (7th ed). Upper Saddle River, NJ: Prentice Hall.

Draper, N.R., & Smith, H.(1998). Applied Regression Analysis. New York: Wiley Series in Probability and Statistics.

Cohen, J., Cohen P., West, S.G., & Aiken, L.S. (2003). Applied multiple regression / correlation analysis for the behavioral sciences (2nd ed.) Hillsdale, NJ: Lawrence Erlbaum Associates.

Teller, G. R. (1999). Mathematical Statistics: A Unified Introduction. New York: Springer.

Kennedy, P. (2003). A Guide to Econometrics. Cambridge, MA: MIT Press.

BACK TO THE
STATISTICAL ANALYSES DIRECTORY

IMPORTANT LINKS ON THIS SITE