3.3 Validation: stability and significance

What is the part in PCA results that is not really accounted by the structure of the data, but by the randomness in the data? Are the results stables and reproducibles? Do the configuration of points change based on the studied data? All of these questions make it necessary to assess the stability of the obtained results.

The stability of the results will depend on the randomness of the data collection process (e.g. random samples, sampling surveys), as well as on the measurement errors in the variables.

3.3.1 How many axes to study and retain?

Are the directions of the first axes well defined and stable? More precisely, are the dispersions in consecutive directions really different? If not, we would have to consider that the factorial plane formed by them is stable but the associated axes are not really different (i.e. indeterminate by a rotation).

One way to answer these questions is to suppose that the data come from a sample drawn from a population with a normal distribution. In this case, the eigenvalues asymptotically follow a normal distribution (Anderson, 1963). Then, we can estimate a 95% confidence interval for each eigenvalue with the formula (3.1)

$\left [ \lambda_{\alpha} \left (1 - 1.96 \sqrt{2/(n-1)} \right ); \hspace{1mm} \lambda_{\alpha} \left (1+1.96\sqrt{2/(n-1)} \right) \right ] \tag{3.1}$

The width of this interval gives us an idea of the stability of the eigenvalue with respect to the sample randomness. The overlapping in the intervals of two consecutive eigenvalues suggests that these eigenvalues are equal (or very similar). The corresponding axes are thus indeterminate by one rotation. Under this situation, the analyst should focus on the interpretation of the subspace defined by the first eigenvalues that are well separated.

Although this result has to do with eigenvalues of covariance matrices, it can also be applied to the eigenvalues of correlation matrices. Simulation studies show that the confidence intervals tend to be “cautious”: the coverage percentage of the true eigenvalue, is almost always greater than the anounced confidence level. In any case, the asymptotic nature of the results, and the underlying hypothesis of normality, lead us to consider the results are merely indicative (not a hard rule).

In regards of the factorial axes, it is convenient to distinguish the axes that will be studied, from the axes that will be used. The factorial axes can be seen as an ultimate result, or also as an intermediate stage for further studies.

For example, a PCA can be a preliminary stage before performing a discriminant analysis. In this case, we will try to use the axes with discriminant power, which may not coincide with the axes of largest spread.

If the goal is to classify individuals, it makes sense to retain only the axes expressing real directions of spread, in order to preserve the stable characteristics of the individuals, while excluding those directions that are mainly capturing random noise.

Scree Test (Cattell’s rule, 1966)

One of the most prevalent questions in PCA is “how many principal components (or factorial axes) to retain?” Unfortunately, there is no simple answer to this question.

If we assume that the $$n$$ values taken by the $$p$$ variables come from a random process that uniformly fills up the space, without privileging any direction, then the $$p$$ eigenvalues of the PCA will slowly decrease in a regular form.

If a PCA provides a histogram of the eigenvalues showing one or more staircase steps, we can think that there are sufficiently strong associations between the variables. These associations would be responsible for the appearance of directions or subspaces where most of the dispersion is concentrated.

Such pragmatic considerations, can be used to determine—in a more or less subjective way—a minimum and a maximum number of axes to retain in the analysis. The main way to do this is through visual inspection of the histogram of eigenvalues following the so-called scree test or elbow criteria proposed by Raymond Cattell (1966). This criteria, which is the simplest and oldest one, involves graphing a line plot of the eigenvalues, ordered from largest to smallest, and then look for the “elbow” of the graph where the eigenvalues seem to level off.

In the example of the cities (first PCA), we obtained the following eigenvalues:

Table 3.1: Distribution of eigenvalues in 1st PCA.
num eigenvalues percentage cumulative
1 10.1390 84.49 84.49
2 0.8612 7.18 91.67
3 0.3248 2.71 94.37
4 0.1715 1.43 95.80
5 0.1484 1.24 97.04
6 0.0973 0.81 97.85
7 0.0682 0.57 98.42
8 0.0525 0.44 98.86
9 0.0505 0.42 99.28
10 0.0332 0.28 99.55
11 0.0309 0.26 99.81
12 0.0226 0.19 100.00

We can then plot a histogram of the eigenvales, and add a line connecting the heights of the bars to better see the way in which the sizes of the eigenvalues decrease: In the second PCA of the salaries divided by the mean salary of a city, we obtained the following eigenvalues:

Table 3.2: Distribution of eigenvalues in 2nd PCA.
num eigenvalues percentage cumulative
1 4.4910 37.43 37.43
2 1.7148 14.29 51.72
3 1.2989 10.82 62.54
4 1.0396 8.66 71.20
5 0.8699 7.25 78.45
6 0.7831 6.53 84.98
7 0.5309 4.42 89.40
8 0.3874 3.23 92.63
9 0.3210 2.67 95.31
10 0.2561 2.13 97.44
11 0.2021 1.68 99.12
12 0.1052 0.88 100.00

Graphing the scree plot we obtain the following display: More formally, Cattell’s criteria consist of sorting the lagged differences of second order between eigenvalues, as follows:

$\begin{equation} d(\alpha) = (\lambda_{\alpha + 1} - \lambda_{\alpha}) - (\lambda_{\alpha} - \lambda_{\alpha - 1}) \tag{3.2} \end{equation}$

The reason why is called scree test has to do with the metaphor of a mountain scree. According to wikipedia, a “scree is a collection of broken rock fragments at the base of crags, mountain cliffs, volcanoes or valley shoulders that has accumulated through periodic rockfall from adjacent cliff faces.” Note

We have seen that when there is a size effect in the first axis, the subsequent eigenvalues are affected and reduced. However, it is possible that subsequent eigenvalues reflect structural oppositions. This is the case of the second PCA on raw data, which corresponds approximately, to the first axis of the analysis on the ratio data, when the size effect is eliminated.

On the other hands, it is risky to interpret the percentage of inertia as a measure of the information contained in an axis. This percentage can be made as small as possible, just by adding independent random variables to the data of active variables. The overall inertia will increase, while the “information” contained in the first axes will remain the same and, consequently, the percentage of inertia in each axis will decrease.

Table 3.3: Distribution of eigenvalues from data with random perturbations.
num eigenvalue percentage cumulative
1 1.7994 15.00 15.00
2 1.5473 12.89 27.89
3 1.4034 11.69 39.58
4 1.2329 10.27 49.86
5 1.1123 9.27 59.13
6 1.0635 8.86 67.99
7 0.8877 7.40 75.39
8 0.7653 6.38 81.76
9 0.7059 5.88 87.65
10 0.6000 5.00 92.65
11 0.5414 4.51 97.16
12 0.3410 2.84 100.00 3.3.2 Simulations, random effects on individuals

One way to assess the stability of results involves using the available information in the data, via computational methods to run some simulations. By following this type of approaches, we are able to free ourselves from the probabilistic assupmtions about the data, which are seldom met when dealing with multivariate data.

The strategy that we use is based on random perturbations of the data, in order to simulate a certain natural variability or measurement error in the observations.

Each observation in the data matrix is replaced by the observed value plus a random quantity that follows a normal distribution with mean and variance depending on the variable under modification.

We denote this change of value as:

$x_{ij} = x_{ij} + N(x_{ij}, Ks_j) \tag{3.3}$

The observed value $$x_{ij}$$ is modified by adding a random quantity that follows a normal distribution, centered at $$x_{ij}$$, and with standard deviation $$K$$ times the standard deviation $$s_j$$ of variable $$j$$.

The value of the constant $$K$$ determines the amount of perturbation that we introduce in the data. $$K=0$$ indicates that the observations remain unchanged. A value of $$K=1$$ means that each observation is affected, on average, one standard deviation.

Once we have modified the data table, we can perform a PCA, calculate its directions, the correlation of the extracted directions with the original-unmodified variables, obtaining a matrix of correlations between axis systems.

In this matrix we will inspect, for each original axis, what other modified axes are most correlated with. We will also check if an axis is correlated with all other axes in analogous way. In the former case, this indicates that an axis is stable, despite the random modifications in the data. In the latter case, this indicates that an axis is the result of randomness in the data.

By looking at the matrix of correlations described in the previous paragraph, we can detect up to what extent the rank of the axes are stables, and from what point the “natural” random fluctuations in data begin.

In the example of the international cities, we show in table (TABLE 3.4) the correlation matrices between the axes (in rows) obtained in the analysis of ratios (salaries of professions with respect to the mean salary of the city) and the axes obtained with a random perturbation of 1%, 5% and 10% of the standard deviation of each variable.

Table 3.4: Assessment of random perturbations (rows correspond to axes of 2nd PCA).
Variables (perturbation 1%) F1 F2 F3 F4 F5
Factorial axis 1 (2nd PCA) 1.00 0.01 0.00 0.00 0.00
Factorial axis 2 (2nd PCA) -0.01 0.99 -0.05 0.12 -0.06
Factorial axis 3 (2nd PCA) 0.00 0.04 0.99 0.04 -0.04
Factorial axis 4 (2nd PCA) 0.00 -0.08 -0.02 0.91 0.37
Factorial axis 5 (2nd PCA) 0.00 -0.10 -0.06 0.35 -0.89

Perturbation of 5%

Variables (perturbation 5%) F1 F2 F3 F4 F5
Factorial axis 1 (2nd PCA) 0.99 0.08 -0.03 0.01 -0.03
Factorial axis 2 (2nd PCA) -0.08 0.87 -0.06 0.20 -0.20
Factorial axis 3 (2nd PCA) 0.01 0.13 0.85 0.24 0.34
Factorial axis 4 (2nd PCA) 0.01 0.06 0.36 -0.57 -0.52
Factorial axis 5 (2nd PCA) 0.03 0.07 0.07 -0.48 0.47

Perturbation of 10%

Variables (perturbation 10%) F1 F2 F3 F4 F5
Factorial axis 1 (2nd PCA) -0.64 0.72 -0.11 -0.05 0.01
Factorial axis 2 (2nd PCA) -0.39 -0.31 -0.37 0.09 -0.32
Factorial axis 3 (2nd PCA) -0.07 0.01 0.24 0.32 0.68
Factorial axis 4 (2nd PCA) 0.11 -0.01 -0.54 -0.14 0.24
Factorial axis 5 (2nd PCA) 0.14 0.30 0.15 0.04 -0.29

By looking at the diagonal of the tables, we observe stability in the first factor, as well as in the second and the third factors, up to a random perturbation of 5% of the original standard deviation. With a high perturbation (of 10%) only the first factor is resistant to the modifications.

Table 3.5 displays the mean and standard deviation of the salary variables (gross salary divided by the city-mean salary), as well as the correlation between the randomly modified variable and the original variable. We can tell that with a random perturbation of 10% the standard deviations increase, while the correlations decrease.

Table 3.5: Summary Statistics of Active Variables affected by random perturbations.
Variable Summary Statistics Original P1% P5% P10%
teacher mean 1.19 1.18 1.18 1.17
teacher std deviation 0.37 0.38 0.44 0.54
teacher correlation - 0.99 0.94 0.71
bus_driver mean 1.04 1.04 1.03 1.05
bus_driver std deviation 0.25 0.26 0.27 0.35
bus_driver correlation - 1.00 0.96 0.60
mechanic mean 0.96 0.96 0.96 1.00
mechanic std deviation 0.24 0.24 0.27 0.52
mechanic correlation - 0.99 0.89 0.46
construction_worker mean 0.72 0.73 0.73 0.72
construction_worker std deviation 0.27 0.26 0.26 0.30
construction_worker correlation - 0.99 0.94 0.71
metal_worker mean 1.17 1.16 1.16 1.16
metal_worker std deviation 0.22 0.23 0.26 0.30
metal_worker correlation - 0.99 0.86 0.79
cook_chef mean 1.40 1.40 1.40 1.38
cook_chef std deviation 0.61 0.63 0.61 0.58
cook_chef correlation - 1.00 0.99 0.95
departmental_head mean 2.63 2.62 2.61 2.53
departmental_head std deviation 1.31 1.31 1.34 1.44
departmental_head correlation - 1.00 0.98 0.95
engineer mean 2.12 2.12 2.07 2.04
engineer std deviation 0.75 0.76 0.77 0.84
engineer correlation - 1.00 0.96 0.90
bank_clerk mean 1.51 1.52 1.48 1.48
bank_clerk std deviation 0.61 0.61 0.63 0.71
bank_clerk correlation - 1.00 0.98 0.90
executive_secretary mean 1.13 1.13 1.13 1.14
executive_secretary std deviation 0.28 0.27 0.27 0.35
executive_secretary correlation - 0.99 0.98 0.96
salesperson mean 0.76 0.76 0.75 0.78
salesperson std deviation 0.16 0.16 0.16 0.18
salesperson correlation - 0.99 0.97 0.52
textile_worker mean 0.68 0.68 0.70 0.66
textile_worker std deviation 0.17 0.18 0.20 0.27
textile_worker correlation - 0.99 0.87 0.72

3.3.3 Bootstrap Simulations

Another way of empirical assessment can be done using random resampling methods on the data. The idea is to obtain a number of data tables, all of the same dimension as the original one, by randomly sampling with replacement the observations in the data. This approach is the so-called Bootstrap method (Efron et al, 1993). Following this approach, it is possible to estimate the sampling errors and the distribution of the various PCA results.

How to implement the bootstrap method? First, we form a large number of samples of $$n$$ individuals which are drawn with replacement from the $$n$$ original individuals in the data. This set of samples is referred to as the bootstrap samples. For each bootstrap sample, some of the original individuals won’t be part of the sample, while some individuals may appear more than once in the sample. Each bootstrap sample gives place to a data table.

On each of the bootstrap tables we calculate its eigenvalues and eigenvectors. We then obtain a bootstrap distribution of the eigenvalues, as well as the bootstrap distribution of the correlations between the eigenvectors and the original axes.

For each eigenvalue we can obtain a confidence interval. Likewise, for each eigenvector we can obtain a confidence cone around the original eigenvector. Examining the correlation between the axes can then reveal potential rotations among axes.

The bootstrap simulations can also be used to assess the stability of the projections of the variables and the categories. We can position the different bootstrap tables as supplementary information in the analysis of the original table (Lebart et al, 1995). In this way, it is possible to visualize in the factorial planes regions of “natural” fluctuation of the different elements in the data table.