6.2 Formulas for PCA

From a matrix standpoint, PCA consists of studying a data matrix $$\mathbf{Z}$$, endowed with a metric matrix $$\mathbf{I}_p$$ defined in $$\mathbb{R}^p$$, and another metric $$\mathbf{N}$$ defined in $$\mathbb{R}^n$$ (generally $$\mathbf{N} = (1/n) \mathbf{I}_n$$).

The matrix $$\mathbf{Z}$$ comes defined in the following way:

• under a normalized PCA: $$\mathbf{Z} = \mathbf{XS}^{-1}$$, where $$\mathbf{S}$$ is the diagonal matrix of standard deviations.

• under a non-normalized PCA: $$\mathbf{Z} = \mathbf{X}$$

The fit in $$\mathbb{R}^p$$ has to do with: $$\mathbf{Z^\mathsf{T}NZu} = \lambda \mathbf{u}$$, with $$\mathbf{u^\mathsf{T}u} = 1$$.

The fit in $$\mathbb{R}^n$$ has to do with: $$\mathbf{N}^{1/2} \mathbf{ZZ^\mathsf{T}N}^{1/2} \mathbf{v} = \lambda \mathbf{v}$$, with $$\mathbf{v^\mathsf{T}v} = 1$$.

The transition relations can be written as:

\begin{align*} \mathbf{u} &= \frac{1}{\sqrt{\lambda}} \mathbf{Z^\mathsf{T}N}^{1/2} \mathbf{v} \\ & \\ \mathbf{v} &= \frac{1}{\sqrt{\lambda}} \mathbf{N}^{1/2} \mathbf{Z} \mathbf{u} \end{align*}

The symmetric matrix to be diagonalized is $$\mathbf{Z^\mathsf{T}NZ}$$. This matrix coincides with the matrix of correlations in the case of a normalized PCA; or with a covariance matrix in the case of a non-normalized PCA.

Coordinates of Individuals

Regardless of whether we are analyzing active individuals or supplementary ones, the coordinates of individuals are calculated by orthogonally projecting the rows of the data matrix $$\mathbf{Z}$$ onto the directions of the eigenvectors $$\mathbf{u}_{\alpha}$$.

$\boldsymbol{\psi}_{\alpha} = \mathbf{Zu}_{\alpha} = \begin{cases} \mathbf{XS}^{-1}\mathbf{u}_{\alpha} & \text{(normalized PCA)} \\ \\ \mathbf{Xu}_{\alpha} & \text{(non-normalized PCA)} \end{cases}$

with $$i$$-th element:

$\psi_{i \alpha} = \begin{cases} \sum_{j=1}^{p} \frac{x_{ij}}{s_j} u_{j\alpha} & \text{(normalized PCA)} \\ \\ \sum_{j=1}^{p} x_{ij} u_{j\alpha} & \text{(non-normalized PCA)} \end{cases}$

Coordinates of Active Variables

The coordinates of the active variables are obtained by the orthogonal projection of the columns of $$\mathbf{Z}$$ onto the directions defined by $$\dot{\mathbf{v}}_{\alpha}$$ with the metric $$\mathbf{N}$$.

The projection of the active variables on an axis $$\alpha$$ are given by:

$\mathbf{\Phi}_{\alpha} = \mathbf{Z^\mathsf{T}N}^{1/2} \mathbf{v}_{\alpha} = \frac{1}{\sqrt{\lambda_{\alpha}}} \mathbf{Z^\mathsf{T}NZu}_{\alpha} = \sqrt{\lambda_{\alpha}} \hspace{1mm} \mathbf{u}_{\alpha}$

with the $$j$$-th element:

$\phi_{j \alpha} = \sqrt{\lambda_{\alpha}} \hspace{1mm} u_{j \alpha}$

Correlation between Variables and PCs

The correlation bewtween a variable $$\mathbf{x_j}$$ and a principal component $$\boldsymbol{\psi}_{\alpha}$$ is given by:

$cor(\alpha, j) = \sum_{i=1}^{n} p_i \left (\frac{x_{ij}}{s_j} \right ) \left (\frac{\psi_{i \alpha}}{\sqrt{\lambda_{\alpha}}} \right )$

Using matrix notation we have:

$\mathbf{cor}_{\alpha} = \frac{1}{\sqrt{\lambda_{\alpha}}} (\mathbf{XS}^{-1})^\mathsf{T} \mathbf{N} (\mathbf{Zu}_{\alpha}) = (\mathbf{XS}^{-1})^\mathsf{T} \mathbf{N}^{1/2} \mathbf{v}_{\alpha}$

$\mathbf{cor}_{\alpha} = \begin{cases} \mathbf{Z^\mathsf{T}N}^{1/2} \mathbf{v}_{\alpha} = \mathbf{\Phi}_{\alpha} & \text{(normalized PCA)} \\ \\ \mathbf{S}^{-1} \mathbf{Z^\mathsf{T}N}^{1/2}\mathbf{v}_{\alpha} = \mathbf{S}^{-1} \mathbf{\Phi}_{\alpha} & \text{(non-normalized PCA)} \end{cases}$

$cor(j, \alpha) = \begin{cases} \phi_{j \alpha} & \text{(normalized PCA)} \\ \\ \phi_{j \alpha} / s_j & \text{(non-normalized PCA)} \end{cases}$

Coordinates of Supplementary Variables

The supplementary variables are located by using the previous rule about the the computation of the coordinates. Let $$\mathbf{Z}_{+}$$ the data matrix containing the supplementary variables. Taking into account the transition relations we have that:

$\mathbf{\Phi}_{\alpha}^{+} = \mathbf{Z^\mathsf{T}N}^{1/2} \mathbf{v}_{\alpha} = \mathbf{Z_{+}^{\mathsf{T}} N} \left (\frac{\mathbf{Zu}_{\alpha}}{\sqrt{\lambda_{\alpha}}} \right )$

The projection of the supplementary variables is computed from this relation between the coordinate of a variable and the projection of the individuals. In a normalized PCA, this projection is equal to the correlation between the variables and the principal component.

$\phi_{j \alpha}^{+} = \begin{cases} \sum_{i} p_i \frac{x_{ij}}{s_j} \frac{\psi_{i \alpha}}{\sqrt{\lambda_{\alpha}}} & \text{(normalized PCA)} \\ \sum_{i} p_i x_{ij} \frac{\psi_{i\alpha}}{\sqrt{\lambda_{\alpha}}} & \text{(non-normalized PCA)} \end{cases}$

6.2.0.1 Old Unit-Vectors in $$R^p$$

Let $$\mathbf{e_j}$$ be a unit vector of the original basis in $$\mathbb{R}^p$$. The projection of this vector onto the new basis is:

$\mathbf{e_j}^\mathsf{T} \mathbf{u}_{\alpha} = u_{j\alpha}$

The elements of vectors $$\mathbf{u}_{\alpha}$$ directly provide the projection of the original axes of $$\mathbb{R}^{p}$$. Each axis of the original basis indicates the direction of growth of a variable. These directions can be jointly represented with the projection of the individual-points.

Distance of Individuals to the Origin

The squared distance of an individual to the origin is the sum of the squares of the values in each row of $$\mathbf{Z}$$ (assuming centered data):

$d^2(i,G) = \sum_{j=1}^{p} z_{ij}^{2} = \begin{cases} \sum_{j} \left (\frac{x_{ij}}{s_j} \right )^2 & \text{(normalized PCA)} \\ \sum_{j} x_{ij}^{2} & \text{(non-normalized PCA)} \end{cases}$

This formula works for both active and supplementary individuals.

Distance of Variables to the Origin

The distance of a variable to the origin is the sum of the squares of the values in the columns of $$\mathbf{Z}$$, taking into account the metric $$\mathbf{N}$$:

$d^2(j,O) = \sum_{i=1}^{n} p_i \hspace{1mm} z_{ij}^{2} = \begin{cases} \frac{\sum_{p_i x_{ij}^{2}}}{s_{j}^{2}} = 1 & \text{(normalized PCA)} \\ \sum_{i} p_i x_{ij}^{2} = s{_j}^{2} & \text{(non-normalized PCA)} \end{cases}$

6.2.0.2 Contribution of Individuals to an Axis’ Inertia

The projected inertia on an axis is: $$\sum_{i=1}^{n} p_i \psi_{i \alpha}^{2} = \lambda_{\alpha}$$.

The part of the inertia due to an individual is:

$CTR(i, \alpha) = \frac{p_i \psi_{i \alpha}^{2}}{\lambda_{\alpha}} \times 100$

this applies to both a normalized and a non-normalized PCA.

Squared Cosines of Individuals

The squared cosine of an individual is the projection of an individual onto an axis, divided by the squared of its distance to the origin:

$cos^2(i, \alpha) = \frac{\psi_{i \alpha}^{2}}{d^2(i,G)}$

Contributions of Variables to the Inertia

The projected inertia onto an axis in $$\mathbb{R}^{n}$$ is: $$\lambda_{\alpha} = \sum_{j}^{p} \varphi_{j\alpha}^{2}$$.

The contribution of a variable to the inertia of the axis is:

$CTR(j, \alpha) = \frac{\varphi_{j\alpha}^{2}}{\lambda_{\alpha}} \times 100$

Taking into account the formula to compute the coordinates of the variables:

$CTR(j, \alpha) = u_{j\alpha}^{2} \times 100$

Squared Cosines of Variables

$cos^2(j, \alpha) = \frac{\phi_{j\alpha}^{2}}{d^2(j,O)}$

The distance of a variable to the origin coincides with the standard deviation of the variable under a non-normalized PCA. In turn, when performing a normalized-PCA, the distance is equal to 1.

$cos^2 (j, \alpha) = cor^2(j, \alpha)$

Coordinates of Categories of Nominal Variables

A category point is the center of gravity of the individuals that have such category:

$\bar{\psi}_{k \alpha} = \frac{\sum_{i \in k} p_i \psi_{i \alpha}}{\sum_{i \in k} p_i}$

Distance of Categories to the Origin

$d^2(k,O) = \sum_{\alpha = 1}^{p} \bar{\psi}_{k \alpha}^{2}$

V-test of Categories

In a v-test we are interested in calculating the critical probability corresponding to the following hypothesis:

\begin{align*} H_0: & \bar{\psi}_{k \alpha} = 0 \\ H_1: & \bar{\psi}_{k \alpha} > 0 \quad \text{or} \quad \bar{\psi}_{k \alpha} < 0 \end{align*}

Under the assumption of random election of individuals with category $$k$$, we have:

\begin{align*} E(\bar{\psi}_{k \alpha}) &= 0 \\ var(\bar{\psi}_{k \alpha}) &= \frac{n - n_k}{n_k - 1} \frac{\lambda_{\alpha}}{n_k} \end{align*}

By the central limit theorem, the variable $$\bar{\psi}_{k \alpha}$$ will (approximately) follow a normal distribution.

The v-test is the value of the standardized variable $$v_{k\alpha}$$ with the same level of significance:

$v_{k \alpha} = \frac{\bar{\psi}_{k \alpha}}{\sqrt{\frac{n-n_k}{n_k - 1}} \frac{\lambda_{\alpha}}{n_k}}$

V-test of Continuous Variables

Let $$\bar{x}_{kj}$$ be the mean of the variable $$j$$ in the group $$k$$. We are interested in calculating the critical probability of the following hypothesis test:

\begin{align*} H_0: & \mu_{k j} = \bar{x}_{j} \\ H_1: & \mu_{k j} > \bar{x}_{j} \quad \text{or} \quad \mu_{kj} < \bar{x}_{j} \end{align*}

Under the null hypothesis, we assume that individuals with category $$k$$ are randomly selected:

\begin{align*} E(\bar{x}_{kj}) &= \bar{x}_{j} \\ var(\bar{x}_{kj}) &= \frac{n - n_k}{n_k - 1} \frac{s_{j}^{2}}{n_k} = s_{kj}^{2} \end{align*}

By the cental limit theorem, the variable $$\bar{x}_{kj}$$ follows (approximately) a normmal distribution.

The v-test is the value of the standardized variable with the same level of significance.

$v_{k\alpha} = \frac{\bar{x}_{k\alpha} - \bar{x}_{j}}{\sqrt{\frac{n - n_k}{n_k - 1} \frac{s_{j}^{2}}{n_k}}}$