on
26 mins to read.
Principal Component Analysis
CURSE OF DIMENSIONALITY
As the number of features (dimensionality) increases, the data becomes relatively more sparse and often exponentially more samples are needed to make statistically significant predictions.
Curse of Dimensionality refers to the fact that many problems that do not exist in low-dimensional space arise in high-dimensional space. It makes it very difficult to identify the patterns in the data without having plenty of training data because of sparsity of training data in the high dimensional space.
Imagine going from a $10 \times 10$ grid to a $10 \times 10 \times 10$ grid. We want ONE sample in each ‘$1 \times 1$ square’, then the addition of the third parameter requires us to have 10 times as many samples (1000) as we needed when we had 2 parameters (100).
High-dimensional datasets are at risk of being very sparse: most training instances are likely to be far away from each other. Of course, this also means that a new instance will likely be far away from any training instances, making predictions are much less reliable than in lower dimensions, since they will be based on much larger extrapolations. In short, the more dimensions the training set has, the greater the risk of overfitting it.
In theory, one solution to the curse of dimensionality could be to increase the size of the training set to reach a sufficient density of training instances. Unfortunately, in practice, the number of training instances required to reach a given density grows exponentially with the number of dimensions.
Linear models with no feature selection or regularization, kNN, Bayesian models are models that are most affected by curse of dimensionality. Models that are less affected by the curse of dimensionaliy are regularized models, random forest, some neural networks, stochastic models (e.g. monte carlo simulations).
PROJECTION
In most real-world problems, training instances are not spread out uniformly across all dimensions. Many features are almost constant, while others are highly correlated. As a result, all training instances actually lie within (or close to) much lower-dimensional subspace of the high dimensional space.
Notice that all training instances lie close to a plane: this is a lower-dimensional (2D) subspace of the high-dimensional (3D) space. Now if we project every training instance perpendicularly onto this subspace (as represented by the short lines connecting the instances to the plane), we get the new 2D dataset
We have just reduced the dataset’s dimensionality from 3D to 2D. Note that the axes correspond to new features z1 and z2 (the coordinates of the projections on the plane).
However, projection is not always the best approach to dimensionality reduction. In many cases the subspace may twist and turn, such as in the famous Swiss roll toy data‐set
Simply projecting onto a plane (e.g., by dropping x3) would squash different layers of the Swiss roll together, as shown on the left. However, what you really want is to unroll the Swiss roll to obtain the 2D dataset on the right.
PRINCIPAL COMPONENT ANALYSIS
Principal Component Analysis (PCA) is by far the most popular dimensionality reduction algorithm. First it identifies the hyperplane that lies closes the data and projects the data onto it.
Preserving Variance
Before you project the training set onto a lower-dimensional hyperplane, you first need to choose the right hyperplane. For example, a simple 2D dataset is represented on the left along with three different axes (i.e., one-dimensional hyperplanes). On the right is the result of the projection of the dataset onto each of these axes.
As one can see easily, the projection onto the solid line preserved the maximum variance, while the project onto the dotted line preserves very little variance and the projection onto the dashed line preserves an intermediate amount of variance. It seems reasonable to select the axis which preserved the maximum amount of variance as it will most likely lose less information than other projections. Another way to justify this choice is that it is the axis that minimizes the mean squared distance between the original dataset and its projection onto that axis. “The maximum variance” and “the minimum error” are reached at the same time. This is rather simple idea behind PCA.
Principal Components
PCA identifies the axis that accounts for the largest amount of variance in the training set. In the example above it is the solid line. It also finds a second axis, orthogonal to the first one, that accounts for the largest amount of remaining variance. In this 2D example there is no choice: it is the dotted line. If it were a higher-dimensional dataset, PCA would also find a third axis, orthogonal to both previous axes, and a fourth, a fifth, and so on—as many axes as the number of dimensions in the dataset.
The unit vector that defines the $i$th axis is called the $i$th principal component (PC). In the example above, the 1st PC is $c_1$ and the 2nd PC is $c_2$.
Every principal component will ALWAYS be orthogonal (a.k.a. official math term for perpendicular) to every other principal component. Because the principal components are orthogonal to one another, they are statistically linearly independent of one another. Hence, also solving the problem of multicollinearity.
Here, the first two PCs are represented by the orthogonal arrows in the plane, and the third PC would be orthogonal to the plane (pointing up or down).
Eigenvalues are the variances of the principal components. In other words, the first eigenvalue is the variance of the first principal component, the second eigenvalue is the variance of the second principal component and so on. Thus, because of the way the principal components are selected, the first eigenvalue will be the largest, the second the next largest etc. There will be $p$ eigenvalues where $p$ is the number of features in the dataset but some may be zero.
NOTE: The direction of the principal components is not stable: if you perturb the training set slightly and run PCA again, some of the new PCs may point in the opposite direction of the original PCs. However, they will generally still lie on the same axes. In some cases, a pair of PCs may even rotate or swap, but the plane they define will generally remain the same.
How to find PCs?
There is a standard matrix factorization technique called Singular Value Decomposition (SVD) that can decompose the training set matrix $X$ into the dot product of three matrices $U \cdot \Sigma \cdot V^{T}$, where $V^{T}$ contains all the principal components that we are looking for:
\[V = \begin{bmatrix} \uparrow & \uparrow & \ldots & \uparrow \\ c_{1} & c_{2} & \ldots & c_{n} \\ \downarrow & \downarrow & \ldots & \downarrow \\ \end{bmatrix}\]The following Python code uses NumPy’s svd()
function to obtain all the principal components of the training set, then extracts the first two PCs:
Note that PCA assumes that the dataset is centered around the origin. As we will see, Scikit-Learn’s PCA classes take care of centering the data for you. However, if you implement PCA yourself (as in the preceding example), or if you use other libraries, don’t forget to center the data first.
Projecting Down to d Dimensions
Once you have identified all the principal components, you can reduce the dimensionality of the dataset down to d dimensions by projecting it onto the hyperplane defined by the first d principal components. Selecting this hyperplane ensures that the projection will preserve as much variance as possible. To project the training set onto the hyperplane, you can simply compute the dot product of the training set matrix $X$ by the matrix $W_d$, defined as the matrix containing the first $d$ principal components (i.e., the matrix composed of the first d columns of $V^{T}$):
\[X_{d-projected} = X \cdot W_{d}\]The following Python code projects the training set onto the plane defined by the first two principal components:
There you have it! You now know how to reduce the dimensionality of any dataset down to any number of dimensions, while preserving as much variance as possible.
PCA using Scikit-Learn
Scikit-Learn’s PCA class implements PCA using SVD decomposition just like we did before. The following code applies PCA to reduce the dimensionality of the dataset down to two dimensions (note that it automatically takes care of centering the data):
Notice that running PCA multiple times on slightly different datasets may result in different results. In general the only difference is that some axes may be flipped. In this example, PCA using Scikit-Learn gives the same projection as the one given by the SVD approach, except both axes are flipped:
Recover the 3D points projected on the plane (PCA 2D subspace).
Of course, there was some loss of information during the projection step, so the recovered 3D points are not exactly equal to the original 3D points:
We can compute the reconstruction error:
The inverse transform in the SVD approach looks like this:
The reconstructions from both methods are not identical because Scikit-Learn’s PCA class automatically takes care of reversing the mean centering, but if we subtract the mean, we get the same reconstruction:
The PCA object gives access to the principal components that it computed:
Compare to the first two principal components computed using the SVD method:
Notice how the axes are flipped.
Now let’s look at the explained variance ratio:
The first dimension explains 84.2% of the variance, while the second explains 14.6%. (total 98.9% percept)
By projecting down to 2D, we lost about 1.1% of the variance:
Here is how to compute the explained variance ratio using the SVD approach (recall that s is the diagonal of the matrix Sigma). Taking square of those values is equal to eigenvalues of $X^{T}X$ (Gram matrix):
Choosing the Right Number of Dimensions
Instead of arbitrarily choosing the number of dimensions to reduce down to, it is generally preferable to choose the number of dimensions that add up to a sufficiently large portion of the variance (e.g., 95%). You can plot the explained variance as a function of the number of dimensions (simply plot cumsum), which is called a scree plot:
or using SVD approach:
PCA for Compression
This curve quantifies how much of the total, 784-dimensional variance is contained within the first N components. For example, we see that with the digits the first 154 components contain approximately 95% of the variance.
PCA can also be used as a filtering approach for noisy data. The idea is this: any components with variance much larger than the effect of the noise should be relatively unaffected by the noise. So if you reconstruct the data using just the largest subset of principal components, you should be preferentially keeping the signal and throwing out the noise.
Obviously after dimensionality reduction, the training set takes up much less space. For example, try applying PCA to the MNIST dataset while preserving 95% of its variance. You should find that each instance will have just over 150 features, instead of the original 784 features. So while most of the variance is preserved, the dataset is now less than 20% of its original size! This is a reasonable compression ratio, and you can see how this can speed up a classification algorithm (such as an SVM classifier) tremendously.
It is also possible to decompress the reduced dataset back to 784 dimensions by applying the inverse transformation of the PCA projection. Of course this won’t give you back the original data, since the projection lost a bit of information (within the 5% variance that was dropped), but it will likely be quite close to the original data. The mean squared distance between the original data and the reconstructed data (compressed and then decompressed) is called the reconstruction error.
This signal preserving/noise filtering property makes PCA a very useful feature selection routine—for example, rather than training a classifier on very high-dimensional data, you might instead train the classifier on the lower-dimensional representation, which will automatically serve to filter out random noise in the inputs.
Example by Hand using Eigendecomposition
Let’s say we have a dataset which is $d+1$ dimensional. Where $d$ could be thought as $X_{train}$ and 1 could be thought as $y_{train}$ (labels) in modern machine learning paradigm. So, $X_{train} + y_{train}$ makes up our complete train dataset.
So, after we drop the labels we are left with $d$ dimensional dataset and this would be the dataset we will use to find the principal components.
Let our data matrix X be the score of three students. In this example, we have $n=5$ and $d=3$.
Compute the mean of every dimension of the whole dataset. We will have $3 \times 1$ array.
Let’s center the data first:
Then, we compute the unbiased covariance matrix (sometimes also called as the variance-covariance matrix) associated with the data.
\[\Sigma_i = \left[ \begin{array}{cc} \sigma_{11}^2 & \sigma_{12}^2 & \sigma_{13}^2\\ \sigma_{21}^2 & \sigma_{22}^2 & \sigma_{23}^2\\ \sigma_{31}^2 & \sigma_{32}^2 & \sigma_{33}^2\\ \end{array} \right]\]So, we can compute the covariance of two variables $X$ and $Y$ using the following formula
\[cov(X, Y) = \frac{1}{n-1} \sum_{i=1}^{n} (X_{i} - \bar{X})(Y_{i} - \bar{Y})\]Using the above formula, we can find the covariance matrix of A. Also, the result would be a square matrix of $d\times d$ dimensions.
For this example, we will have $3 \times 3$ square matrix.
Compute Eigenvectors and corresponding Eigenvalues
Intuitively, an eigenvector is a vector whose direction remains unchanged when a linear transformation is applied to it.
Now, we can easily compute eigenvalues and eigenvectors from the covariance matrix above.
Let $A$ be a square matrix, $v$ a vector and $u$ a scalar that satisfies $Av = uv$, then $u$ is called eigenvalue associated with eigenvector $v$ of $A$.
The eigenvalues of $A$ are roots of the characteristic equation
\[det (A - uI) = 0\]Sort the eigenvectors by decreasing eigenvalues and choose $k$ eigenvectors with the largest eigenvalues to form a $d \times k$ dimensional matrix $W$.
We started with the goal to reduce the dimensionality of our feature space, i.e., projecting the feature space via PCA onto a smaller subspace, where the eigenvectors will form the axes of this new feature subspace. So, in order to decide which eigenvector(s) we want to drop for our lower-dimensional subspace, we have to take a look at the corresponding eigenvalues of the eigenvectors. Roughly speaking, the eigenvectors with the lowest eigenvalues bear the least information about the distribution of the data, and those are the ones we want to drop.
The common approach is to rank the eigenvectors from highest to lowest corresponding eigenvalue and choose the top $k$ eigenvectors.
So, after sorting the eigenvalues in decreasing order, we have:
You can also find the explained variance ratio using eigenvalues:
The first two components explains $97.17\%$ of the variance.
For our simple example, where we are reducing a 3-dimensional feature space to a 2-dimensional feature subspace, we are combining the two eigenvectors with the highest eigenvalues to construct our $d \times k$ dimensional eigenvector matrix $W$.
So, eigenvectors corresponding to two maximum eigenvalues are :
Transform the samples onto the new subspace:
In order to make sure that we have not made a mistake in our step by step approach, we will use another library that doesn’t rescale the input data by default.
Here, we will use the PCA
class from the scikit-learn machine-learning library. For our convenience, we can directly specify to how many components we want to reduce our input dataset via the n_components
parameter.
Notice that running PCA multiple times on slightly different datasets may result in different results. In general the only difference is that some axes may be flipped. In this example, PCA using Scikit-Learn gives the same projection as the one given by the SVD approach, except both axes are flipped:
WHY PCA WORKS?
PCA is a method that brings together:
- A measure of how each variable is associated with one another. (Covariance matrix.)
- The directions in which our data are dispersed. (Eigenvectors represent directions.)
- The relative importance of these different directions. (Eigenvalues represent magnitude, or importance.)
PCA combines our predictors and allows us to drop the eigenvectors that are relatively unimportant.