## 6.2 Formulas for PCA

From a matrix standpoint, PCA consists of studying a data matrix \(\mathbf{Z}\), endowed with a metric matrix \(\mathbf{I}_p\) defined in \(\mathbb{R}^p\), and another metric \(\mathbf{N}\) defined in \(\mathbb{R}^n\) (generally \(\mathbf{N} = (1/n) \mathbf{I}_n\)).

The matrix \(\mathbf{Z}\) comes defined in the following way:

under a normalized PCA: \(\mathbf{Z} = \mathbf{XS}^{-1}\), where \(\mathbf{S}\) is the diagonal matrix of standard deviations.

under a non-normalized PCA: \(\mathbf{Z} = \mathbf{X}\)

The fit in \(\mathbb{R}^p\) has to do with: \(\mathbf{Z^\mathsf{T}NZu} = \lambda \mathbf{u}\), with \(\mathbf{u^\mathsf{T}u} = 1\).

The fit in \(\mathbb{R}^n\) has to do with: \(\mathbf{N}^{1/2} \mathbf{ZZ^\mathsf{T}N}^{1/2} \mathbf{v} = \lambda \mathbf{v}\), with \(\mathbf{v^\mathsf{T}v} = 1\).

The transition relations can be written as:

\[\begin{align*} \mathbf{u} &= \frac{1}{\sqrt{\lambda}} \mathbf{Z^\mathsf{T}N}^{1/2} \mathbf{v} \\ & \\ \mathbf{v} &= \frac{1}{\sqrt{\lambda}} \mathbf{N}^{1/2} \mathbf{Z} \mathbf{u} \end{align*}\]

The symmetric matrix to be diagonalized is \(\mathbf{Z^\mathsf{T}NZ}\). This matrix coincides with the matrix of correlations in the case of a normalized PCA; or with a covariance matrix in the case of a non-normalized PCA.

#### Coordinates of Individuals

Regardless of whether we are analyzing active individuals or supplementary ones, the coordinates of individuals are calculated by orthogonally projecting the rows of the data matrix \(\mathbf{Z}\) onto the directions of the eigenvectors \(\mathbf{u}_{\alpha}\).

\[ \boldsymbol{\psi}_{\alpha} = \mathbf{Zu}_{\alpha} = \begin{cases} \mathbf{XS}^{-1}\mathbf{u}_{\alpha} & \text{(normalized PCA)} \\ \\ \mathbf{Xu}_{\alpha} & \text{(non-normalized PCA)} \end{cases} \]

with \(i\)-th element:

\[ \psi_{i \alpha} = \begin{cases} \sum_{j=1}^{p} \frac{x_{ij}}{s_j} u_{j\alpha} & \text{(normalized PCA)} \\ \\ \sum_{j=1}^{p} x_{ij} u_{j\alpha} & \text{(non-normalized PCA)} \end{cases} \]

#### Coordinates of Active Variables

The coordinates of the active variables are obtained by the orthogonal projection of the columns of \(\mathbf{Z}\) onto the directions defined by \(\dot{\mathbf{v}}_{\alpha}\) with the metric \(\mathbf{N}\).

The projection of the active variables on an axis \(\alpha\) are given by:

\[ \mathbf{\Phi}_{\alpha} = \mathbf{Z^\mathsf{T}N}^{1/2} \mathbf{v}_{\alpha} = \frac{1}{\sqrt{\lambda_{\alpha}}} \mathbf{Z^\mathsf{T}NZu}_{\alpha} = \sqrt{\lambda_{\alpha}} \hspace{1mm} \mathbf{u}_{\alpha} \]

with the \(j\)-th element:

\[ \phi_{j \alpha} = \sqrt{\lambda_{\alpha}} \hspace{1mm} u_{j \alpha} \]

#### Correlation between Variables and PCs

The correlation bewtween a variable \(\mathbf{x_j}\) and a principal component \(\boldsymbol{\psi}_{\alpha}\) is given by:

\[ cor(\alpha, j) = \sum_{i=1}^{n} p_i \left (\frac{x_{ij}}{s_j} \right ) \left (\frac{\psi_{i \alpha}}{\sqrt{\lambda_{\alpha}}} \right ) \]

Using matrix notation we have:

\[ \mathbf{cor}_{\alpha} = \frac{1}{\sqrt{\lambda_{\alpha}}} (\mathbf{XS}^{-1})^\mathsf{T} \mathbf{N} (\mathbf{Zu}_{\alpha}) = (\mathbf{XS}^{-1})^\mathsf{T} \mathbf{N}^{1/2} \mathbf{v}_{\alpha} \]

\[ \mathbf{cor}_{\alpha} = \begin{cases} \mathbf{Z^\mathsf{T}N}^{1/2} \mathbf{v}_{\alpha} = \mathbf{\Phi}_{\alpha} & \text{(normalized PCA)} \\ \\ \mathbf{S}^{-1} \mathbf{Z^\mathsf{T}N}^{1/2}\mathbf{v}_{\alpha} = \mathbf{S}^{-1} \mathbf{\Phi}_{\alpha} & \text{(non-normalized PCA)} \end{cases} \]

\[ cor(j, \alpha) = \begin{cases} \phi_{j \alpha} & \text{(normalized PCA)} \\ \\ \phi_{j \alpha} / s_j & \text{(non-normalized PCA)} \end{cases} \]

#### Coordinates of Supplementary Variables

The supplementary variables are located by using the previous rule about the the computation of the coordinates. Let \(\mathbf{Z}_{+}\) the data matrix containing the supplementary variables. Taking into account the transition relations we have that:

\[ \mathbf{\Phi}_{\alpha}^{+} = \mathbf{Z^\mathsf{T}N}^{1/2} \mathbf{v}_{\alpha} = \mathbf{Z_{+}^{\mathsf{T}} N} \left (\frac{\mathbf{Zu}_{\alpha}}{\sqrt{\lambda_{\alpha}}} \right ) \]

The projection of the supplementary variables is computed from this relation between the coordinate of a variable and the projection of the individuals. In a normalized PCA, this projection is equal to the correlation between the variables and the principal component.

\[ \phi_{j \alpha}^{+} = \begin{cases} \sum_{i} p_i \frac{x_{ij}}{s_j} \frac{\psi_{i \alpha}}{\sqrt{\lambda_{\alpha}}} & \text{(normalized PCA)} \\ \sum_{i} p_i x_{ij} \frac{\psi_{i\alpha}}{\sqrt{\lambda_{\alpha}}} & \text{(non-normalized PCA)} \end{cases} \]

#### 6.2.0.1 Old Unit-Vectors in \(R^p\)

Let \(\mathbf{e_j}\) be a unit vector of the original basis in \(\mathbb{R}^p\). The projection of this vector onto the new basis is:

\[ \mathbf{e_j}^\mathsf{T} \mathbf{u}_{\alpha} = u_{j\alpha} \]

The elements of vectors \(\mathbf{u}_{\alpha}\) directly provide the projection of the original axes of \(\mathbb{R}^{p}\). Each axis of the original basis indicates the direction of growth of a variable. These directions can be jointly represented with the projection of the individual-points.

#### Distance of Individuals to the Origin

The squared distance of an individual to the origin is the sum of the squares of the values in each row of \(\mathbf{Z}\) (assuming centered data):

\[ d^2(i,G) = \sum_{j=1}^{p} z_{ij}^{2} = \begin{cases} \sum_{j} \left (\frac{x_{ij}}{s_j} \right )^2 & \text{(normalized PCA)} \\ \sum_{j} x_{ij}^{2} & \text{(non-normalized PCA)} \end{cases} \]

This formula works for both active and supplementary individuals.

#### Distance of Variables to the Origin

The distance of a variable to the origin is the sum of the squares of the values in the columns of \(\mathbf{Z}\), taking into account the metric \(\mathbf{N}\):

\[ d^2(j,O) = \sum_{i=1}^{n} p_i \hspace{1mm} z_{ij}^{2} = \begin{cases} \frac{\sum_{p_i x_{ij}^{2}}}{s_{j}^{2}} = 1 & \text{(normalized PCA)} \\ \sum_{i} p_i x_{ij}^{2} = s{_j}^{2} & \text{(non-normalized PCA)} \end{cases} \]

#### 6.2.0.2 Contribution of Individuals to an Axis’ Inertia

The projected inertia on an axis is: \(\sum_{i=1}^{n} p_i \psi_{i \alpha}^{2} = \lambda_{\alpha}\).

The part of the inertia due to an individual is:

\[ CTR(i, \alpha) = \frac{p_i \psi_{i \alpha}^{2}}{\lambda_{\alpha}} \times 100 \]

this applies to both a normalized and a non-normalized PCA.

#### Squared Cosines of Individuals

The squared cosine of an individual is the projection of an individual onto an axis, divided by the squared of its distance to the origin:

\[ cos^2(i, \alpha) = \frac{\psi_{i \alpha}^{2}}{d^2(i,G)} \]

#### Contributions of Variables to the Inertia

The projected inertia onto an axis in \(\mathbb{R}^{n}\) is: \(\lambda_{\alpha} = \sum_{j}^{p} \varphi_{j\alpha}^{2}\).

The contribution of a variable to the inertia of the axis is:

\[ CTR(j, \alpha) = \frac{\varphi_{j\alpha}^{2}}{\lambda_{\alpha}} \times 100 \]

Taking into account the formula to compute the coordinates of the variables:

\[ CTR(j, \alpha) = u_{j\alpha}^{2} \times 100 \]

#### Squared Cosines of Variables

\[ cos^2(j, \alpha) = \frac{\phi_{j\alpha}^{2}}{d^2(j,O)} \]

The distance of a variable to the origin coincides with the standard deviation of the variable under a non-normalized PCA. In turn, when performing a normalized-PCA, the distance is equal to 1.

\[ cos^2 (j, \alpha) = cor^2(j, \alpha) \]

#### Coordinates of Categories of Nominal Variables

A category point is the center of gravity of the individuals that have such category:

\[ \bar{\psi}_{k \alpha} = \frac{\sum_{i \in k} p_i \psi_{i \alpha}}{\sum_{i \in k} p_i} \]

#### Distance of Categories to the Origin

\[ d^2(k,O) = \sum_{\alpha = 1}^{p} \bar{\psi}_{k \alpha}^{2} \]

#### V-test of Categories

In a v-test we are interested in calculating the critical probability corresponding to the following hypothesis:

\[\begin{align*} H_0: & \bar{\psi}_{k \alpha} = 0 \\ H_1: & \bar{\psi}_{k \alpha} > 0 \quad \text{or} \quad \bar{\psi}_{k \alpha} < 0 \end{align*}\]

Under the assumption of random election of individuals with category \(k\), we have:

\[\begin{align*} E(\bar{\psi}_{k \alpha}) &= 0 \\ var(\bar{\psi}_{k \alpha}) &= \frac{n - n_k}{n_k - 1} \frac{\lambda_{\alpha}}{n_k} \end{align*}\]

By the central limit theorem, the variable \(\bar{\psi}_{k \alpha}\) will (approximately) follow a normal distribution.

The v-test is the value of the standardized variable \(v_{k\alpha}\) with the same level of significance:

\[ v_{k \alpha} = \frac{\bar{\psi}_{k \alpha}}{\sqrt{\frac{n-n_k}{n_k - 1}} \frac{\lambda_{\alpha}}{n_k}} \]

#### V-test of Continuous Variables

Let \(\bar{x}_{kj}\) be the mean of the variable \(j\) in the group \(k\). We are interested in calculating the critical probability of the following hypothesis test:

\[\begin{align*} H_0: & \mu_{k j} = \bar{x}_{j} \\ H_1: & \mu_{k j} > \bar{x}_{j} \quad \text{or} \quad \mu_{kj} < \bar{x}_{j} \end{align*}\]

Under the null hypothesis, we assume that individuals with category \(k\) are randomly selected:

\[\begin{align*} E(\bar{x}_{kj}) &= \bar{x}_{j} \\ var(\bar{x}_{kj}) &= \frac{n - n_k}{n_k - 1} \frac{s_{j}^{2}}{n_k} = s_{kj}^{2} \end{align*}\]

By the cental limit theorem, the variable \(\bar{x}_{kj}\) follows (approximately) a normmal distribution.

The v-test is the value of the standardized variable with the same level of significance.

\[ v_{k\alpha} = \frac{\bar{x}_{k\alpha} - \bar{x}_{j}}{\sqrt{\frac{n - n_k}{n_k - 1} \frac{s_{j}^{2}}{n_k}}} \]