Statistics Tutor: Multiple Linear Regression in R, Matlab, SPSS, SAS, Stata - New York, Chicago, San Francisco, Boston, Los Angeles, London, Toronto

Statistical & Financial Consulting by Stanford PhD

Home Page

MULTIPLE LINEAR REGRESSION

Multiple Linear Regression (MLR) is a method used to model the linear relationship between a dependent variable and one or more independent variables. The dependent variable is also called the response variable, while the independent variables are called the predictors. The regression model is the following:

$Y = B_0 + B_1 X_1 + ... + B_p X_p + e.$

Here Y is the response, X₁, ... , X_p are the predictors and e is random variable with zero expectation. Variable e is called residual. It denotes the error in predicting Y with the help of X₁, ... , X_p.

In the most general set-up, predictors X₁, ..., X_p are assumed to be stochastic, and residual e has zero expectation, conditional on X₁, ..., X_p. Variable e does not have to be independent of X₁, ... , X_p. In fact, it is the error of measurement, so its variance may depend on various relevant factors, like the ones recorded in X₁, ... , X_p.

To estimate the regression model we observe several realizations of random vector (Y, X₁, ... , X_p). Note that variable e is not observed. However, if we find a way to estimate regression coefficients B₀, ... , B_p accurately, we can arrive at an accurate estimate of e. Let us denote the realizations of vector (Y, X₁, ..., X_p) as (Y₁, X₁₁, ..., X_p1), ... , (Y_n, X_1n, ..., X_pn). There are unobserved residuals e₁, ... , e_n, corresponding to the realizations.

1] If e₁, ... , e_n are assumed to be independent and having the same variance, the regression model is estimated using the method of Ordinary Least Squares (OLS).

2] If e₁, ... , e_n are assumed to be independent and having possibly different variances, the regression model is estimated using the method of Weighted Least Squares (WLS).

3] If e₁, ... , e_n are assumed to be correlated and having possibly different variances, the regression model is estimated using the method of Generalized Least Squares (GLS).

The data do not have to be jointly normal (Gaussian) for the estimation to work. In fact the following is true: if the residuals are normal conditional on predictors (X₁₁, ..., X_p1), ... , (X_1n, ..., X_pn), then the OLS / WLS / GLS method is most statistically efficient (accurate). Depending on whether the correlation and variance structure of residuals e₁, ... , e_n is known, variations of WLS and GLS are used. Note that X₁, ..., X_p and Y do not have to be normal.

After the estimation has been done, it's the time for diagnostics. The estimates of regression coefficients B₀, ... , B_p as well as the estimates of correlations and variances of the residuals are used in the following procedures.

1] Whether the true value of coefficient B_i is 0 is determined by a t-test. If B_i = 0, then predictor X_i does not have any linear predictive power and must be dropped from the model. Here we assume that the other regression coefficients equal their estimates based on the given data set. If we conclude that B_i ≠ 0, then predictor X_i is called statistically significant.

2] Whether true values of several coefficients B_i1, ... , B_ik are 0 is determined by a F-test. If B_i1 = 0, ... , B_ik = 0, then predictors X_i1, ... , X_ik does not have any linear predictive power and must be dropped from the model. Again, we assume that the other regression coefficients equal their estimates based on the given data set.

3] Whether predictors X₁, ..., X_p explain big part of variation in Y is determined using the R-squared, the adjusted R-squared, Cp and other statistics.

4] The diagnostics of outliers and high-leverage points is performed using Cook's distance, the leverage statistic and other relevant metrics. If flawed data points are detected they are dropped from the data set.

As the next stage, we experiment with adding new and dropping some of the old predictors to see what the resulting models are. At the end, we want to have a collection of candidate models, where each predictor is significant. Now we choose the best model using one of the standard model selection methods. This model can be used for prediction on a completely new data set.

Note, that multiple linear regression is called "linear" because of the linear dependence of the response on the regression coefficients. The relationship between the response and the predictors is not important. Even if the true relationship is the following:

$Y = B_0 + B_1 \log(X_1),$

we can denote X'₁ = log(X₁) and make the model linear in the predictor:

$Y = B_0 + B_1 X_1'.$

The new notation does not change the estimates of coefficients B₀ and B₁ at all.

MULTIPLE LINEAR REGRESSION REFERENCES

Freedman, D., Pisani, R., & Purves, R. (1998). Statistics (3rd ed). New York: W. W. Norton & Company.

Dekking, F. M., Kraaikamp, C., Lopuhaä, H. P., & Meester, M. E. (2007). A Modern Introduction to Probability and Statistics: Understanding Why and How (3rd ed). London: Springer.

Greene, W. H. (2011). Econometric Analysis (7th ed). Upper Saddle River, NJ: Prentice Hall.

Draper, N.R., & Smith, H.(1998). Applied Regression Analysis. New York: Wiley Series in Probability and Statistics.

Cohen, J., Cohen P., West, S.G., & Aiken, L.S. (2003). Applied multiple regression / correlation analysis for the behavioral sciences (2nd ed.) Hillsdale, NJ: Lawrence Erlbaum Associates.

Teller, G. R. (1999). Mathematical Statistics: A Unified Introduction. New York: Springer.

Kennedy, P. (2003). A Guide to Econometrics. Cambridge, MA: MIT Press.

BACK TO THE STATISTICAL ANALYSES DIRECTORY

IMPORTANT LINKS ON THIS SITE

Detailed description of the services offered in the areas of statistical and financial consulting: home page, types of service, experience, case studies, payment options and statistics tutoring
Directory of financial topics

consulting@stanfordphd.com