Statistical & Financial Consulting by Stanford PhD
PRINCIPAL COMPONENT ANALYSIS

Principal Component Analysis (PCA) is a form of data compression. Suppose we have information stored in P correlated random variables. Since the variables are correlated they contain less information than P uncorrelated ones. In a degenerate example when one variable is a linear combination of the others, the amount of stored information corresponds to P-1 uncorrelated variables at best.

PCA delivers the optimal way of approximating the P variables with linear combinations of Q uncorrelated factors, where Q < P. The optimality is meant in the sense of minimizing the variances of discrepancies between the variables and their approximations. The uncorrelated factors are called principal components. Here we "compress" the P original variables into Q uncorrelated factors. Knowing the values of those factors allows us to approximate the values of the original variables at any time.

The principal components are calculated via a singular value decomposition (SVD) of the covariance matrix V of the original variables. The SVD decomposition has the form:

$V = U D U',$

where
U is an orthogonal matrix and D is a diagonal matrix, with the values decreasing along the diagonal. The columns of U represent the eigenvectors of V. The diagonal of D contains the eigenvalues of V.

Let
X be a P-by-1 vector containing the values of the original random variables. Then the P-by-1 vector of principal components is calculated as

$PC = (PC_1, PC_2, ... , PC_P) = U' X.$

The first
Q coordinates of PC are the Q factors necessary for compressing the original variables. We see that the principal components are the result of an orthogonal transformation of the original variables.

The first principal component
PC1 equals the inner product of X and the first eigenvector of V (the first column of U). PC1 has variance equal to the first diagonal element of D, which is the largest eigenvalue of V. For that reason, PC1 has the largest variance among random variables which are an inner product of X and a vector of unit length. The second principal component PC2 equals the inner product of X and the second eigenvector of V. PC2 has variance equal to the second diagonal element of D, which is the second largest eigenvalue of V. For that reason, PC2 has the second largest variance among random variables which are an inner product of X and a vector of unit length. And so on.

PCA is closely related to factor analysis. Factor analysis typically incorporates more constraints on the underlying factors and the structure of random shocks. As the result, factor analysis solves for eigenvectors of a slightly different matrix.

PRINCIPAL COMPONENT ANALYSIS REFERENCES

Abdi. H., & Williams, L.J. (2010). Principal component analysis. Wiley Interdisciplinary Reviews: Computational Statistics, 2: 433–459.

Jolliffe I.T. (2002). Principal Component Analysis Series: Springer Series in Statistics (2nd ed.), XXIX. New York: Springer. ISBN 978-0-387-95442-4.

Hastie, T., Tibshirani, R., & Friedman, J. H. (2008). The elements of statistical learning: Data mining, inference, and prediction. New York: Springer.

Fukunaga, K. (1990). Introduction to Statistical Pattern Recognition. Elsevier. ISBN 0-12-269851-7.

BACK TO THE
STATISTICAL ANALYSES DIRECTORY