## 6.2 Formulas for PCA

From a matrix standpoint, PCA consists of studying a data matrix $$\mathbf{Z}$$, endowed with a metric matrix $$\mathbf{I}_p$$ defined in $$\mathbb{R}^p$$, and another metric $$\mathbf{N}$$ defined in $$\mathbb{R}^n$$ (generally $$\mathbf{N} = (1/n) \mathbf{I}_n$$).

The matrix $$\mathbf{Z}$$ comes defined in the following way:

• under a normalized PCA: $$\mathbf{Z} = \mathbf{XS}^{-1}$$, where $$\mathbf{S}$$ is the diagonal matrix of standard deviations.

• under a non-normalized PCA: $$\mathbf{Z} = \mathbf{X}$$

The fit in $$\mathbb{R}^p$$ has to do with: $$\mathbf{Z^\mathsf{T}NZu} = \lambda \mathbf{u}$$, with $$\mathbf{u^\mathsf{T}u} = 1$$.

The fit in $$\mathbb{R}^n$$ has to do with: $$\mathbf{N}^{1/2} \mathbf{ZZ^\mathsf{T}N}^{1/2} \mathbf{v} = \lambda \mathbf{v}$$, with $$\mathbf{v^\mathsf{T}v} = 1$$.

The transition relations can be written as:

\begin{align*} \mathbf{u} &= \frac{1}{\sqrt{\lambda}} \mathbf{Z^\mathsf{T}N}^{1/2} \mathbf{v} \\ & \\ \mathbf{v} &= \frac{1}{\sqrt{\lambda}} \mathbf{N}^{1/2} \mathbf{Z} \mathbf{u} \end{align*}

The symmetric matrix to be diagonalized is $$\mathbf{Z^\mathsf{T}NZ}$$. This matrix coincides with the matrix of correlations in the case of a normalized PCA; or with a covariance matrix in the case of a non-normalized PCA.

#### Coordinates of Individuals

Regardless of whether we are analyzing active individuals or supplementary ones, the coordinates of individuals are calculated by orthogonally projecting the rows of the data matrix $$\mathbf{Z}$$ onto the directions of the eigenvectors $$\mathbf{u}_{\alpha}$$.

$\boldsymbol{\psi}_{\alpha} = \mathbf{Zu}_{\alpha} = \begin{cases} \mathbf{XS}^{-1}\mathbf{u}_{\alpha} & \text{(normalized PCA)} \\ \\ \mathbf{Xu}_{\alpha} & \text{(non-normalized PCA)} \end{cases}$

with $$i$$-th element:

$\psi_{i \alpha} = \begin{cases} \sum_{j=1}^{p} \frac{x_{ij}}{s_j} u_{j\alpha} & \text{(normalized PCA)} \\ \\ \sum_{j=1}^{p} x_{ij} u_{j\alpha} & \text{(non-normalized PCA)} \end{cases}$

#### Coordinates of Active Variables

The coordinates of the active variables are obtained by the orthogonal projection of the columns of $$\mathbf{Z}$$ onto the directions defined by $$\dot{\mathbf{v}}_{\alpha}$$ with the metric $$\mathbf{N}$$.

The projection of the active variables on an axis $$\alpha$$ are given by:

$\mathbf{\Phi}_{\alpha} = \mathbf{Z^\mathsf{T}N}^{1/2} \mathbf{v}_{\alpha} = \frac{1}{\sqrt{\lambda_{\alpha}}} \mathbf{Z^\mathsf{T}NZu}_{\alpha} = \sqrt{\lambda_{\alpha}} \hspace{1mm} \mathbf{u}_{\alpha}$

with the $$j$$-th element:

$\phi_{j \alpha} = \sqrt{\lambda_{\alpha}} \hspace{1mm} u_{j \alpha}$

#### Correlation between Variables and PCs

The correlation bewtween a variable $$\mathbf{x_j}$$ and a principal component $$\boldsymbol{\psi}_{\alpha}$$ is given by:

$cor(\alpha, j) = \sum_{i=1}^{n} p_i \left (\frac{x_{ij}}{s_j} \right ) \left (\frac{\psi_{i \alpha}}{\sqrt{\lambda_{\alpha}}} \right )$

Using matrix notation we have:

$\mathbf{cor}_{\alpha} = \frac{1}{\sqrt{\lambda_{\alpha}}} (\mathbf{XS}^{-1})^\mathsf{T} \mathbf{N} (\mathbf{Zu}_{\alpha}) = (\mathbf{XS}^{-1})^\mathsf{T} \mathbf{N}^{1/2} \mathbf{v}_{\alpha}$

$\mathbf{cor}_{\alpha} = \begin{cases} \mathbf{Z^\mathsf{T}N}^{1/2} \mathbf{v}_{\alpha} = \mathbf{\Phi}_{\alpha} & \text{(normalized PCA)} \\ \\ \mathbf{S}^{-1} \mathbf{Z^\mathsf{T}N}^{1/2}\mathbf{v}_{\alpha} = \mathbf{S}^{-1} \mathbf{\Phi}_{\alpha} & \text{(non-normalized PCA)} \end{cases}$

$cor(j, \alpha) = \begin{cases} \phi_{j \alpha} & \text{(normalized PCA)} \\ \\ \phi_{j \alpha} / s_j & \text{(non-normalized PCA)} \end{cases}$

#### Coordinates of Supplementary Variables

The supplementary variables are located by using the previous rule about the the computation of the coordinates. Let $$\mathbf{Z}_{+}$$ the data matrix containing the supplementary variables. Taking into account the transition relations we have that:

$\mathbf{\Phi}_{\alpha}^{+} = \mathbf{Z^\mathsf{T}N}^{1/2} \mathbf{v}_{\alpha} = \mathbf{Z_{+}^{\mathsf{T}} N} \left (\frac{\mathbf{Zu}_{\alpha}}{\sqrt{\lambda_{\alpha}}} \right )$

The projection of the supplementary variables is computed from this relation between the coordinate of a variable and the projection of the individuals. In a normalized PCA, this projection is equal to the correlation between the variables and the principal component.

$\phi_{j \alpha}^{+} = \begin{cases} \sum_{i} p_i \frac{x_{ij}}{s_j} \frac{\psi_{i \alpha}}{\sqrt{\lambda_{\alpha}}} & \text{(normalized PCA)} \\ \sum_{i} p_i x_{ij} \frac{\psi_{i\alpha}}{\sqrt{\lambda_{\alpha}}} & \text{(non-normalized PCA)} \end{cases}$

#### 6.2.0.1 Old Unit-Vectors in $$R^p$$

Let $$\mathbf{e_j}$$ be a unit vector of the original basis in $$\mathbb{R}^p$$. The projection of this vector onto the new basis is:

$\mathbf{e_j}^\mathsf{T} \mathbf{u}_{\alpha} = u_{j\alpha}$

The elements of vectors $$\mathbf{u}_{\alpha}$$ directly provide the projection of the original axes of $$\mathbb{R}^{p}$$. Each axis of the original basis indicates the direction of growth of a variable. These directions can be jointly represented with the projection of the individual-points.

#### Distance of Individuals to the Origin

The squared distance of an individual to the origin is the sum of the squares of the values in each row of $$\mathbf{Z}$$ (assuming centered data):

$d^2(i,G) = \sum_{j=1}^{p} z_{ij}^{2} = \begin{cases} \sum_{j} \left (\frac{x_{ij}}{s_j} \right )^2 & \text{(normalized PCA)} \\ \sum_{j} x_{ij}^{2} & \text{(non-normalized PCA)} \end{cases}$

This formula works for both active and supplementary individuals.

#### Distance of Variables to the Origin

The distance of a variable to the origin is the sum of the squares of the values in the columns of $$\mathbf{Z}$$, taking into account the metric $$\mathbf{N}$$:

$d^2(j,O) = \sum_{i=1}^{n} p_i \hspace{1mm} z_{ij}^{2} = \begin{cases} \frac{\sum_{p_i x_{ij}^{2}}}{s_{j}^{2}} = 1 & \text{(normalized PCA)} \\ \sum_{i} p_i x_{ij}^{2} = s{_j}^{2} & \text{(non-normalized PCA)} \end{cases}$

#### 6.2.0.2 Contribution of Individuals to an Axis’ Inertia

The projected inertia on an axis is: $$\sum_{i=1}^{n} p_i \psi_{i \alpha}^{2} = \lambda_{\alpha}$$.

The part of the inertia due to an individual is:

$CTR(i, \alpha) = \frac{p_i \psi_{i \alpha}^{2}}{\lambda_{\alpha}} \times 100$

this applies to both a normalized and a non-normalized PCA.

#### Squared Cosines of Individuals

The squared cosine of an individual is the projection of an individual onto an axis, divided by the squared of its distance to the origin:

$cos^2(i, \alpha) = \frac{\psi_{i \alpha}^{2}}{d^2(i,G)}$

#### Contributions of Variables to the Inertia

The projected inertia onto an axis in $$\mathbb{R}^{n}$$ is: $$\lambda_{\alpha} = \sum_{j}^{p} \varphi_{j\alpha}^{2}$$.

The contribution of a variable to the inertia of the axis is:

$CTR(j, \alpha) = \frac{\varphi_{j\alpha}^{2}}{\lambda_{\alpha}} \times 100$

Taking into account the formula to compute the coordinates of the variables:

$CTR(j, \alpha) = u_{j\alpha}^{2} \times 100$

#### Squared Cosines of Variables

$cos^2(j, \alpha) = \frac{\phi_{j\alpha}^{2}}{d^2(j,O)}$

The distance of a variable to the origin coincides with the standard deviation of the variable under a non-normalized PCA. In turn, when performing a normalized-PCA, the distance is equal to 1.

$cos^2 (j, \alpha) = cor^2(j, \alpha)$

#### Coordinates of Categories of Nominal Variables

A category point is the center of gravity of the individuals that have such category:

$\bar{\psi}_{k \alpha} = \frac{\sum_{i \in k} p_i \psi_{i \alpha}}{\sum_{i \in k} p_i}$

#### Distance of Categories to the Origin

$d^2(k,O) = \sum_{\alpha = 1}^{p} \bar{\psi}_{k \alpha}^{2}$

#### V-test of Categories

In a v-test we are interested in calculating the critical probability corresponding to the following hypothesis:

\begin{align*} H_0: & \bar{\psi}_{k \alpha} = 0 \\ H_1: & \bar{\psi}_{k \alpha} > 0 \quad \text{or} \quad \bar{\psi}_{k \alpha} < 0 \end{align*}

Under the assumption of random election of individuals with category $$k$$, we have:

\begin{align*} E(\bar{\psi}_{k \alpha}) &= 0 \\ var(\bar{\psi}_{k \alpha}) &= \frac{n - n_k}{n_k - 1} \frac{\lambda_{\alpha}}{n_k} \end{align*}

By the central limit theorem, the variable $$\bar{\psi}_{k \alpha}$$ will (approximately) follow a normal distribution.

The v-test is the value of the standardized variable $$v_{k\alpha}$$ with the same level of significance:

$v_{k \alpha} = \frac{\bar{\psi}_{k \alpha}}{\sqrt{\frac{n-n_k}{n_k - 1}} \frac{\lambda_{\alpha}}{n_k}}$

#### V-test of Continuous Variables

Let $$\bar{x}_{kj}$$ be the mean of the variable $$j$$ in the group $$k$$. We are interested in calculating the critical probability of the following hypothesis test:

\begin{align*} H_0: & \mu_{k j} = \bar{x}_{j} \\ H_1: & \mu_{k j} > \bar{x}_{j} \quad \text{or} \quad \mu_{kj} < \bar{x}_{j} \end{align*}

Under the null hypothesis, we assume that individuals with category $$k$$ are randomly selected:

\begin{align*} E(\bar{x}_{kj}) &= \bar{x}_{j} \\ var(\bar{x}_{kj}) &= \frac{n - n_k}{n_k - 1} \frac{s_{j}^{2}}{n_k} = s_{kj}^{2} \end{align*}

By the cental limit theorem, the variable $$\bar{x}_{kj}$$ follows (approximately) a normmal distribution.

The v-test is the value of the standardized variable with the same level of significance.

$v_{k\alpha} = \frac{\bar{x}_{k\alpha} - \bar{x}_{j}}{\sqrt{\frac{n - n_k}{n_k - 1} \frac{s_{j}^{2}}{n_k}}}$