3.5 Optimal Reconstitution of Data

Principal Component Analysis allows us to approximate a data matrix, generally of column-rank $$p$$, by using a matrix of lower rank defined by the first eigenvalues and their corresponding eigenvectors.

The formula (3.5), referred to as the singular value decompoisition (SVD), lets us approximate the original values $$x_{ij}$$ with a factorization of some set of eigenvectors and eigenvalues. In other words, we can obtain an approximate reconstitution of the data values by using only a few $$q$$ values and vectors from the SVD.

$\hat{x}^{q}_{ij} = \sum_{\alpha = 1}^{q} \sqrt{\lambda_{\alpha}} \hspace{1mm} v_{i\alpha} u_{j\alpha} \tag{3.5}$

The term $$\hat{x}^{q}_{ij}$$ is an approximation of the observed value $$x_{ij}$$ from a small set of coefficients calculated from a PCA: the eigenvalues $$\lambda_{\alpha}$$, the eigenvectors $$v_{i\alpha}$$ and $$u_{j\alpha}$$, of rank 1 to $$q$$.

This reconstitution is optimal in the sense that it provides the best least-squares approximation of the original matrix: minimizing the sum of squares of the deviations between the observed values and the approximated values (for all $$q$$):

$\min \left \{ \sum_{i} \sum_{j} (x_{ij} - \hat{x}_{ij}^{q})^2 \right \}$

It can be proved that:

$\sum_{i} \sum_{j} (x_{ij} - \hat{x}^{q}_{ij})^2 = \sum_{\alpha = q+1}^{p} \lambda_{\alpha} \tag{3.6}$

The sum of the $$p-q$$ excluded eigenvalues measures the amount of error when approximating the original cloud of points by its projection onto a subspace of dimension $$q$$.

This property is of great practical application. It justifies the utilization of PCA in a data compression problem (for example, in the reconstitution of images, and also in data transmission).

Application to Image Reconstitution

Image reconstitution, for instance images from satellites, is one of the most interesting applications of principal components analysis. In this case, the size of the data tables tends to be “large” (with information about the gray level for each pixel). Applying PCA on such a table, allows us to detect a set of significant eigenvalues in terms of the irregularities present in an image.

The reconstitution enables an important reduction in storage capacity, because one goes from an image of $$n \times n$$ pixels into another image of size $$q \times (2n + 1)$$, where $$q$$ is the number of retained axes in the analysis.