Mathematical meaning of principal component analysis (PCA)
Main idea
The idea of the principal component analysis (PCA) is to replace the basis in order to reduce the dimension of the input data with minimal loss in informativeness. In other words, we will try to introduce new predictors for the old data so that the information content of the new predictors is maximum. At the same time, we do not discard part of the data, but make some composition of features, which eventually become fewer. The idea is easier to understand from the image below.
The blue dots forming, conditionally, an ellipse are data (each object has a pair of features). It seems like it's logical that along the
Construction of the first main component
Definition 1. Centering – subtracting from each coordinate of each object the average of the values of this coordinate for all objects.
To begin with, it is always better to center the data, in fact, it has little effect, because we just shifted the center of the coordinate system, but this will simplify future calculations.
Let's consider some set of objects
After changing the power of the basis to one, each object
Then the new coordinates of all objects can be written as a column vector
Since we want to draw a new line so that the new coordinates differ from each other as much as possible, we will use sample variance as a measure of difference
Definition 2. A straight line passing through the origin, the coordinates of the projections of the centered source objects on which have the largest sample variance, is called the first main component and is denoted
Definition 3. The guiding vector
To find the new coordinate
So, our task is to find such weights
The second and subsequent main components
Let the initial feature space have dimension p >= 2 and it is necessary to construct not one, but k principal components. The first main component is constructed similarly to the example considered. The second and subsequent Main components are constructed so that they also pass through the origin, orthogonally to all previously constructed main components. The requirement of orthogonality of the main components ensures that there is no correlation between the new features of objects. At the same time, the next main component is constructed so that the sample variance of projections is maximal.
Finding the vector of weights of the first main component
Definition 4. Let n objects
where in the i-th row of the matrix F are the corresponding features of the i–th object.
Next, we assume that the matrix F is obtained as a result of centering some initial matrix of objects F’.
Remark. Since we have p features, the guiding vector (the vector of weights) will also have p features:
In matrix notation, we will write the coordinates of the vector
vector of the same name:
Definition 5. The vector of accounts of the first main component is a column vector of the form
vector of weights of the first main component. In other words, the vector of accounts of the first main component is the new coordinates of objects relative to the first main component.
Theorem 1. Let
be a matrix consisting of centered source data, and let
Let me remind you that the sample variance of the
formula:
transformed as follows:
But taking into account the fact that the features
averages are zero, that is
the last sum is zero.
we get:
weights such that the square of the length
That is,
Finding an arbitrary main component
Let the first main component
And in general, let
Similarly, as in the previous case, definitions of the guiding vector, the vector of accounts are introduced, and the formula for calculating the vector of accounts of the k-th main component remains the same.
A small conclusion on the calculation of the k-th main component
Summing up, the search for the k-th main component or, what is the same thing, the vector of weights
1) It is stated that the vector has unit length, that is
2) It is stated that the vector
3) Then we solve the problem
4) After, using the formula
Example of finding the first main component
Suppose we have a table of elements that contain two attributes each:
Result (x(i)) | attribute 1 (X’(1)) | attribute 2 (X’(2)) |
1 | 9 | 19 |
2 | 6 | 22 |
3 | 11 | 27 |
4 | 12 | 25 |
5 | 7 | 22 |
Let's perform centering of features:
result (x(i)) | attribute 1 (X’(1)) | attribute 2 (X’(2)) |
1 | 0 | -4 |
2 | -3 | -1 |
3 | 2 | 4 |
4 | 3 | 2 |
5 | -2 | -1 |
According to the written table, we will make a matrix:
Further, under the condition
Now, we need to maximize the sample variance
Let's open the brackets and give similar terms.
that when
Given a condition
Since the direction of the first main component is given by the vector of weights
Choosing the number of main components
Definition 9. A nonzero
Let's consider the sequence of actions for finding the main components. Let F be a centered matrix containing information about n objects with p features.
1) Find a sample covariance matrix from the ratio
2) Find the eigenvalues
3) Find the orthonormal eigenvectors
4) Select the required number of main components. At the same time, as the vector of weights of the first main component, it is reasonable to take the eigenvector of the theta matrix corresponding to the largest eigenvalue of this matrix (so that the loss of initial information is the least). As a vector of weights, the second main component is an eigenvector corresponding to the second largest eigenvalue, and so on.
5) Find new coordinates of objects in the selected basis by performing multiplication
Remark. Note that the eigenvalues for
Definition 10. Let
is called the proportion of the explained variance.
Remark. It is easy to notice that
Example of finding the first main component
Suppose we have a table of elements that contain two attributes each:
Result (x(i)) | attribute 1 (X’(1)) | attribute 2 (X’(2)) |
1 | 9 | 19 |
2 | 6 | 22 |
3 | 11 | 27 |
4 | 12 | 25 |
5 | 7 | 22 |
Let's perform centering of features:
result (x(i)) | attribute 1 (X’(1)) | attribute 2 (X’(2)) |
1 | 0 | -4 |
2 | -3 | -1 |
3 | 2 | 4 |
4 | 3 | 2 |
5 | -2 | -1 |
According to the written table , we will make a matrix:
Let's make a selective
Let's find the eigenvalues of this matrix. Recall that the eigenvalues are found from the condition that the determinant of the difference
We get the following roots:
Since we choose the maximum among all eigenvalues for the first principal component and by definition
And, accordingly, we will find the roots of this equation:
Then, if
Let’s find the proportion of the explained variance if we leave one main component:
We got about 0.8, a good result.
Conclusion
In this article, we examined the key idea of the principal component analysis, learnt how to find the vectors of the weights of the principal components, how to determine how much information we lost during transformation, and considered a couple of examples.