An Exploration of Pathologies of Multilevel Principal Components Analysis in Statistical Models of Shape

3D facial surface imaging is a useful tool in dentistry and in terms of diagnostics and treatment planning. Between-group PCA (bgPCA) is a method that has been used to analyse shapes in biological morphometrics, although various “pathologies” of bgPCA have recently been proposed. Monte Carlo (MC) simulated datasets were created here in order to explore “pathologies” of multilevel PCA (mPCA), where mPCA with two levels is equivalent to bgPCA. The first set of MC experiments involved 300 uncorrelated normally distributed variables, whereas the second set of MC experiments used correlated multivariate MC data describing 3D facial shape. We confirmed results of numerical experiments from other researchers that indicated that bgPCA (and so also mPCA) can give a false impression of strong differences in component scores between groups when there is none in reality. These spurious differences in component scores via mPCA decreased significantly as the sample sizes per group were increased. Eigenvalues via mPCA were also found to be strongly affected by imbalances in sample sizes per group, although this problem was removed by using weighted forms of covariance matrices suggested by the maximum likelihood solution of the two-level model. However, this did not solve problems of spurious differences between groups in these simulations, which was driven by very small sample sizes in one group. As a “rule of thumb” only, all of our experiments indicate that reasonable results are obtained when sample sizes per group in all groups are at least equal to the number of variables. Interestingly, the sum of all eigenvalues over both levels via mPCA scaled approximately linearly with the inverse of the sample size per group in all experiments. Finally, between-group variation was added explicitly to the MC data generation model in two experiments considered here. Results for the sum of all eigenvalues via mPCA predicted the asymptotic amount for the total amount of variance correctly in this case, whereas standard “single-level” PCA underestimated this quantity.


Introduction
Geometric morphometrics aims to provide a mathematical description of biological shapes [1][2][3][4][5]. Three-dimensional (3D) surface scanning [6] is a technique that allows one to capture the 3D shape, e.g., of the human face, as shown in Figure 1 for the author's face. It is a useful tool in understanding dental and maxillofacial diagnostics, treatment planning, and the effects of treatment [7]. Such biological shapes may be described by a set of landmark points, illustrated also in Figure 1. Methods such as Procrustes transformation [1] are often used to standardise centring, orientation, and scale in a dataset of such 3D shapes.
Multivariate data contains more than one "outcome" variable, such as the x-, y-and zcomponents of the Cartesian landmark points (again, as shown in Figure 1). These variables tend to be highly correlated and so multivariate statistical methods such as principal components analysis (PCA) [1] are needed to in order to analyse such data. Betweengroup PCA (bgPCA) [8,9] is an extension of standard PCA that carries out separate PCAs on (between-group) covariance matrices based on "group means" and (within group) covariance matrices based on individual shapes around these means. It has much in on (between-group) covariance matrices based on "group means" and (within group) covariance matrices based on individual shapes around these means. It has much in common with (though it is not the same as) canonical variates analysis/linear discriminant analysis [10]. Multilevel PCA (mPCA) has been used by us [11][12][13][14][15][16][17] to analyse 3D facial shapes obtained from 3D facial scans; note that two-level multilevel PCA (mPCA) is equivalent to bgPCA. mPCA has been used by us to investigate changes by ethnicity and sex [11,12], the act of smiling [13,14], facial shape changes in adolescents due to age [15,16], and the effects of maternal smoking and alcohol consumption on the facial shape of English adolescents [17]. Recent articles [8,9,[18][19][20] have pointed out a number of "pathologies" in techniques such as bgPCA (and therefore also mPCA). Perhaps the most notable pathology [8,9] is that spurious conclusions about differences between groups can occur when the number of parameters in the model is much larger than sample sizes used in the model. Another limitation occurs when sample sizes are not balanced between groups [8]. Here, we wish to explore these pathologies by carrying out Monte Carlo simulated experiments firstly using uncorrelated normally distributed variables and secondly for correlated multivariate normally distributed data based on "real" data using the 21 landmark points in Figure 1.

Monte Carlo Data Generation
Data is generated via Monte Carlo techniques and a number of "experiments" are carried out here, as shown in Table 1. For Experiment 1, = 300 uncorrelated standard normal distribution (i.e., mean = 0 and standard deviation = 1) are used, exactly as in [8]. Data for each variable is generated using the randn() command in MATLAB (R2021a). The number of groups is set to be equal to 3 in the results presented here, and the sample size per group is varied explicitly, where indicates a specific group. The overall sample size for groups is given by = ∑ . Note that the limits are placed on the magnitude of eigenvalues via PCA by the Marchenko-Pastur theorem [21] for such random (uncorrelated) data such that eigenvalues must lie between ( − 1) and ( + 1) , where = / in the limit that both and tend to infinity. For Experiment 2, 300 uncorrelated standard variables are again used in these simulations. In order to explore imbalances in sample sizes, the sample size per group in group 3 is set to be n = 10, whereas the sample size per group , for the two other groups is varied explicitly. There is no be- Recent articles [8,9,[18][19][20] have pointed out a number of "pathologies" in techniques such as bgPCA (and therefore also mPCA). Perhaps the most notable pathology [8,9] is that spurious conclusions about differences between groups can occur when the number of parameters in the model is much larger than sample sizes used in the model. Another limitation occurs when sample sizes are not balanced between groups [8]. Here, we wish to explore these pathologies by carrying out Monte Carlo simulated experiments firstly using uncorrelated normally distributed variables and secondly for correlated multivariate normally distributed data based on "real" data using the 21 landmark points in Figure 1.

Monte Carlo Data Generation
Data is generated via Monte Carlo techniques and a number of "experiments" are carried out here, as shown in Table 1. For Experiment 1, p = 300 uncorrelated standard normal distribution (i.e., mean = 0 and standard deviation = 1) are used, exactly as in [8]. Data for each variable is generated using the randn() command in MATLAB (R2021a). The number of groups is set to be equal to 3 in the results presented here, and the sample size per group n l is varied explicitly, where l indicates a specific group. The overall sample size for m groups is given by n = m ∑ l n l . Note that the limits are placed on the magnitude of eigenvalues via PCA by the Marchenko-Pastur theorem [21] for such random (uncorrelated) data such that eigenvalues must lie between √ y − 1 2 and √ y + 1 2 , where y = p/n in the limit that both p and n tend to infinity. For Experiment 2, 300 uncorrelated standard variables are again used in these simulations. In order to explore imbalances in sample sizes, the sample size per group in group 3 is set to be n = 10, whereas the sample size per group n 1,2 for the two other groups is varied explicitly. There is no between-group variation in Experiments 1 and 2, and so the total variance over all 300 independent/uncorrelated variables is also equal to 300. For Experiment 3, 300 uncorrelated standard normally distributed variables are used at level 2 of the model in Figure 2, although between-group variation is allowed in this case explicitly. A constant offset for each group of subjects is added to all variables, where this offset itself follows a normal distribution with mean = 0 and standard deviation = 0.25 (here) and is independent of the "within group" source of variations. For each variable, between-group variance equals 0.25 2 = 0.0625 and "within-group" variance equals 1. The total variance over all 300 mutually independent variables is equal to: 300 × 1.0625 = 318.75. The percentage of the total variance explained due by between-group variation (Equation (3) below) in Experiment 3 is given by 5.7% (= 100 × 18.75/318.75), whereas this percentage is clearly 0% in Experiments 1 and 2. 100 MC datasets are used in these simulations for Experiments 1 to 3, except for when the sample size per group was equal to 300 (where only 50 MC datasets were used due to computational demands).

Multilevel Principal Components Analysis (mPCA)
Features (i.e., 300 variables here for Experiments 1 to 3 and 63 components for Experiments 4 and 5) are represented by a vector . Single-level PCA is carried out by finding the mean vector over all data points and a covariance matrix given by and indicate elements of this covariance matrix and refers to a given data point in the set. The eigenvalues and (orthonormal) eigenvectors of this matrix are found readily. Note that the rank of this covariance matrix (and so also the number of non-zero eigenvalues) is limited to − 1. For PCA, one ranks all the eigenvalues into descending order, and one retains the first components in the model. The vector is modeled by The coefficients { } (also referred to here as "component scores") are found readily by using a scalar product with respect to the set of orthonormal eigenvectors, i.e., = • ( − ), for a fit of the model to a new vector .
The mPCA model used here is illustrated schematically in Figure 2. Note that separate covariance matrices are found at levels 1 and 2 for mPCA. Group means at level 2 are denoted and the covariance matrix at level 2 is just the average of all of the "local" covariance matrices for each group . (The rank of each of these covariance matrices is limited to − 1.) The overall "grand mean" at level 1 (denoted by ) is the average over all local group means at level 2, i.e., = ∑ / . The level 1 covariance ma- Experiments 4 and 5 use data from [12] for 21 3D landmark points (i.e., 21 × 3 = 63 variables), again as shown in Figure 1. This landmark point data was transformed using a generalised Procrustes analysis [1]. A Procrustes transformation is one that involves translation, rotation, or uniform scaling (or a combination of all of them). Here, it was used to form a common origin, orientation and length scale for all shapes represented by the sets of landmark points with respect to a reference given by the mean shape. This data is used here to form an average covariance matrix (over the two matrices for males and females separately), which is then used in a multivariate normal random number generator (i.e., the mvnrnd() command in MATLAB) in order to create the MC data. Variables are therefore correlated in Experiments 4 and 5. Only two groups of equal sample size (n 1 = n 2 ) that correspond to males and females in the original dataset are employed in Experiments 4 and 5. However, Experiment 4 uses a single mean vector (i.e., an average over subjects for the data in [12]), thereby implying that there is no between-group variation for Experiment 4. Any differences between "males" and "females" in the two groups are therefore spurious in this case and the percentage of total variance due to between-group variation (Equation (3) below) is equal to zero asymptotically with respect to increasing sample size. By contrast, Experiment 5 assumes separate mean shape vectors for males and females (again obtained directly from data in [12]), which implies that between-group variance at level 1 is non-zero in this case. Note that we find a value of 10.4% for the percentage of total variance due to between-group variation (Equation (3) below) directly from the original experimental data (n 1 = 124; n 2 = 126) in this case. A total of 100 MC datasets are used in all simulations for Experiments 4 and 5.

Multilevel Principal Components Analysis (mPCA)
Features (i.e., 300 variables here for Experiments 1 to 3 and 63 components for Experiments 4 and 5) are represented by a vector z. Single-level PCA is carried out by finding the mean vector µ over all data points and a covariance matrix given by k 1 and k 2 indicate elements of this covariance matrix and i refers to a given data point in the set. The eigenvalues λ l and (orthonormal) eigenvectors u l of this matrix are found readily. Note that the rank of this covariance matrix (and so also the number of non-zero eigenvalues) is limited to n − 1. For PCA, one ranks all the eigenvalues into descending order, and one retains the first p 1 components in the model. The vector z is modeled by The coefficients {a l } (also referred to here as "component scores") are found readily by using a scalar product with respect to the set of orthonormal eigenvectors, i.e., a l = u l ·(z − µ), for a fit of the model to a new vector z.
The mPCA model used here is illustrated schematically in Figure 2. Note that separate covariance matrices are found at levels 1 and 2 for mPCA. Group means at level 2 are denoted µ 2 l and the covariance matrix Σ 2 at level 2 is just the average of all of the "local" covariance matrices Σ 2 l for each group l. (The rank of each of these covariance matrices is limited to n l − 1). The overall "grand mean" at level 1 (denoted by µ 1 ) is the average over all local group means µ 2 l at level 2, i.e., µ 1 = ∑ m l=1 µ 2 l /m. The level 1 covariance matrix is given by , where m is the number of groups. (The rank of this covariance matrix is limited to m − 1). Both of these covariance matrices are diagonalised separately, where each eigenvalue at level 1 is denoted by λ 1 l , with associated eigenvector u 1 l , and each eigenvalue at level 2 is denoted by λ 2 l , with associated eigenvector u 2 l . We rank the eigenvalues into descending order at each level of the model separately, and then we retain the first p 1 and p 2 eigenvectors of largest magnitude at the two levels. The percentage variation at level 1 via mPCA with respect to the overall variation is The vector z is modeled for the two-level model shown in Figure 2 by The coefficients a 1 l and a 2 l (also referred to as "component scores" here) are determined for mPCA by using a global optimisation procedure in MATLAB.

Maximum Likelihood Solution
In order to find the maximum likelihood solution, we assume an expression for the likelihood function of a two-level model that is given by N(z i µ 2 l Σ 2 ) is a multivariate normal distribution, where µ 2 l is the mean for group l and Σ 2 is the covariance matrix at level 2 (assuming here a common covariance matrix at this level as for mPCA). n l is sample size for group l and m is the number of groups.
N(µ 2 l µ 1 Σ 1 ) is another multivariate normal distribution, where µ 1 is the "grand" mean and Σ 1 is the covariance matrix at level 1. The associated log likelihood (LL) is The maximum likelihood solution is therefore given by and Equations (7) and (8) are (almost) identical to the equations used in mPCA presented above when sample sizes per group n l are equal to each other for all groups l, although they are "population" rather than "sample" covariance matrices in this case (i.e., there is a factor of either m or n in the denominator rather than m − 1 or n − 1, respectively). We can diagonalise these "weighted" estimates of the covariance matrices. Those groups with larger sample sizes will have a commensurately larger influence on the covariance matrices (and means) than those groups with smaller sample sizes. This approach should therefore address problems in mean and covariance matrix estimation due to imbalances in sample sizes across groups. Indeed, this approach seems very similar (if not identical) to the weighting scheme proposed in [9]. Results from Experiment 2 from standard mPCA are denoted Experiment 2a below and results from the "weighted" covariance matrices Equations (7) and (8) are denoted Experiment 2b.

Results
Results for the eigenvalues from mPCA and single-level PCA for Experiment 1 are shown in Figure 3. The magnitude of these eigenvalues at level 1 via mPCA (with respect to the total variation) reduces strongly with increasing sample size per group n l , as shown in Figure 3. The percentage variation at level 1 via mPCA also reduces strongly with increasing values of n l , as shown in Table 2. Indeed, both measures are clearly tending towards the correct asymptotic value of zero in the limit of "infinite" sample size per group. The average sum of eigenvalues for single-level PCA over all MC simulations is (to within expected sampling error) equal to 300 for all values of sample size per group n l . It is stated in [9] that mPCA underestimates the variation due to within-group effects (i.e., at level 2 of the mPCA model). However, we find that the sum of eigenvalues at level 2 for the mPCA model averaged over all MC simulations is also (again to within expected sampling error) equal to 300 in all simulations, which seems apparently to contradict this statement. Figure 4 shows that results for the sum of eigenvalues over both levels via mPCA extrapolate to the correct value of 300 in the limit n l → ∞ . We see also from Figure 3 that the curves for the eigenvalues via single-level PCA and level 2 mPCA become flatter as we increase the sample size per group n l , which agrees with the Marchenko-Pastur theorem [21]. a = 1, the text following an equation need not be a new paragraph. Please punctuate equatio regular text. This is the example 2 of equation:  Results for component scores are also given in Figure 3. As in [8], we find that strong apparent differences seem to occur between groups occurs at level 1 of the mPCA model. We see from Figure 3 that this occurs also via single-level PCA, albeit to a lesser extent. Again, these differences are due to random sampling effects and so are spurious. (Note that group centroids of component scores at level 2 via mPCA were indeed congruent with the origin for all MC simulated datasets and in all experiments carried out here). Differences between groups in Figure 3 become less pronounced for both mPCA and single-level PCA as the sample size per group is increased. Indeed, very strong overlap occurs in components scores and spurious differences between groups are quite small for a sample size per group of n l = 300 at level 1 via mPCA, as shown by the group centroids in this figure. Experiment 1 shows that random differences between groups that are spread over all 300 variables (and therefore probably also over possible principal components via traditional single-level PCA) are now being concentrated in just two components at the level 1 of the mPCA model. Experiment 1 indicates (as a "rule of thumb" only) that the sample sizes per group should at least be of similar magnitude to the number of variables, i.e., 300, in order to obtain reasonable results. Differences between groups in Figure 3 become less pronounced for both mPCA and single-level PCA as the sample size per group is increased. Indeed, very strong overlap occurs in components scores and spurious differences between groups are quite small for a sample size per group of = 300 at level 1 via mPCA, as shown by the group centroids in this figure. Experiment 1 shows that random differences between groups that are spread over all 300 variables (and therefore probably also over possible principal components via traditional single-level PCA) are now being concentrated in just two components at the level 1 of the mPCA model. Experiment 1 indicates (as a "rule of thumb" only) that the sample sizes per group should at least be of similar magnitude to the number of variables, i.e., 300, in order to obtain reasonable results. Results for the eigenvalues from mPCA and single-level PCA for Experiment 2a are shown in Figure 5. In this case, the sample sizes per group are varied for groups 1 and 2 only, whereas group 3 has = 10 in all simulations. The magnitude of these eigenvalues at level 1 mPCA shown in Figure 5 and percentage variation shown in Table 2 reduce with increasing sample size per group, n, although they are clearly "saturating" by n = 100 for groups 1 and 2. It is clear that the covariance matrices at both level 1 and 2 via mPCA are being very strongly affected by the small sample size in group 3. For example, eigenvalues at level 2 via mPCA exhibit a strange "spike" for low values of eigenvalue number. Eigenvalues at level 1 via mPCA are higher than those in Figure 3, where sample sizes are equal across all groups. However, we again find that the sum of eigenvalues for both singlelevel PCA and mPCA at level 2 is equal to 300 (again to within expected sampling error). Table 2 shows that the percentage variance does not tend to correct value of zero percent; a result is driven by the small sample size in group 3. Figure 4 shows that results for the sum of eigenvalues at both levels via mPCA do not extrapolate to the correct value of 300 in the limit → ∞ in Experiment 2a. Results for the eigenvalues from mPCA and single-level PCA for Experiment 2a are shown in Figure 5. In this case, the sample sizes per group are varied for groups 1 and 2 only, whereas group 3 has n 3 = 10 in all simulations. The magnitude of these eigenvalues at level 1 mPCA shown in Figure 5 and percentage variation shown in Table 2 reduce with increasing sample size per group, n, although they are clearly "saturating" by n = 100 for groups 1 and 2. It is clear that the covariance matrices at both level 1 and 2 via mPCA are being very strongly affected by the small sample size in group 3. For example, eigenvalues at level 2 via mPCA exhibit a strange "spike" for low values of eigenvalue number. Eigenvalues at level 1 via mPCA are higher than those in Figure 3, where sample sizes are equal across all groups. However, we again find that the sum of eigenvalues for both single-level PCA and mPCA at level 2 is equal to 300 (again to within expected sampling error). Table 2 shows that the percentage variance does not tend to correct value of zero percent; a result is driven by the small sample size in group 3. Figure 4 shows that results for the sum of eigenvalues at both levels via mPCA do not extrapolate to the correct value of 300 in the limit n l → ∞ in Experiment 2a.  Figure 5 show that group 3 is also a clear outlier for both mPCA, level 1 and single-level PCA. (Again, group centroids for mPCA at level 2 are congruent with the origin). It is noticeable that group 3 produces an outlying result in Figure 5 even for single-level PCA for n 1,2 = 300. Spurious differences are exaggerated for mPCA compared to single-level PCA, especially between groups 1 and 2 compared to group 3 for mPCA. Reasonable results for component scores via mPCA are never achieved in Experiment 2a due to the low sample size in group 3.

Component scores in
Results for the eigenvalues from mPCA and single-level PCA for Experiment 2b are shown in Figure 6. Again, the sample sizes per group are varied for groups 1 and 2 only, whereas group 3 has n 3 = 10 in all simulations. We see that eigenvalues for level 2 mPCA now are of very similar magnitude to results of single-level PCA. Indeed, we see that problems with leading eigenvalues for both levels 1 and 2 mPCA due to imbalances in sample sizes appear to have been removed in Figure 6 by the weighted form of the covariance matrices, which is an encouraging result. Eigenvalues for level 2, mPCA are very slightly lower in magnitude in Figure 6 than single-level PCA because Equations (7) and (8) are essentially population rather than sample covariance matrices, although this effect reduces quickly with increasing sample size per group n 1,2 . Figure 4 shows that results for the sum of eigenvalues via mPCA extrapolate to the correct value of 300 in the limit n l → ∞ for Experiment 2b. Version March 4, 2022 submitted to Journal Not Specified We see from Figure 6 that problems of spurious differences between groups are not removed by the weighted forms of Equations (7) and (8). Differences between groups that are contained in all variables (and again are probably spread over all components via single-level PCA) are again being concentrated in just two components at level 1 via mPCA. These spurious differences reduce strongly between groups 1 and 2 with increasing sample sizes in these groups, although they persist between groups 1 and 2 compared to group 3 even up to n 1,2 = 300, as shown in Figure 6. Reasonable results for component scores via mPCA are never achieved in Experiment 2b due to the low sample size in group 3, even when the weighted forms of the covariance matrices are used. Figure 7 shows results for Experiment 3 in which between-group variation is added to the data, where the means of each group now follow a normal distribution with means equal to zero and a standard deviation of 0.25. Table 2 shows that mPCA is clearly tending towards the theoretical value of 5.9% with increasing sample size per group. Note that the sum of eigenvalues at level 2 via mPCA is again equal to 300 (within expected sampling error). Figure 4 demonstrates that the sum of eigenvalues over both levels via mPCA scale approximately linearly with n −1 l to a value of 318.55 in the limit n l → ∞ . This is good agreement with the aysmptotic value of 318.75, although even better correspondence would presumably also be obtained by including higher values of n l in the regression data in Figure 4. Figure 4 shows also that the values for the sum of all eigenvalues via single-level PCA is approximately flat with respect to n −1 l . Indeed, the total variation captured by single-level PCA is clearly well below the asymptotic overall total value of 318.75. Figure 7 shows that both mPCA and single-level PCA overestimate differences between groups in component scores for small sample sizes per group. However, the broad pattern in the component scores for mPCA (and single-level PCA) has largely converged for sample size per group n l = 200 (not shown here) and it has certainly converged by the time that n l = 300 is reached (shown in Figure 7). Again, Experiment 3 indicates again (as a rough-and-ready "rule of thumb") that reasonable results are obtained when the sample sizes in all groups are of similar magnitude to the number of variables, i.e., 300 here. Results for the eigenvalues from mPCA and single-level PCA for Experiments 4 and 5 are shown in Figures 8 and 9. These results for the correlated data in Experiments 4 and 5 are very similar to those earlier results from Experiments 1 to 3, which involved uncorrelated data. We see from Figure 8 and from Table 2 that variances at level 1 mPCA for Experiment 4 reduces with increasing sample size per group. Figure 10 shows that the sum of eigenvalues at both levels via mPCA scales approximately linearly with n −1 l for Experiment 4 and that this line extrapolates to a value that is very close that of single-level PCA in the limit n l → ∞ . By contrast, Figure 9 shows that eigenvalues at level 1 do not tend to zero as the sample size per group increases for Experiment 5 and we note again that between-group variation has been explicitly added to the MC data in this case. Figure 10 shows that the sum of eigenvalues at both levels via mPCA scales approximately linearly with n −1 l for Experiment 5 and that it extrapolates in the limit n l → ∞ to a value that is much larger than that from single-level PCA. Table 2 shows that the percentage of variance explained by level 1 via mPCA for Experiment 5 converges to a non-zero value probably near to about 10%.  Figures 8 and 9 for Experiments 4 and 5 again also show a very similar pattern to those results in Experiments 1 to 3. Strong initial differences between groups in component scores via mPCA (and single-level PCA to some extent also) reduce strongly as sample sizes per group increased in Experiment 4. Indeed, it is noticeable in Figure 8 that differences between groups via mPCA are fairly small for n l = 100. By contrast, differences in component scores between groups via mPCA are observed for all sample sizes per group in Experiment 5, where between-group variation has been added explicitly to the MC data, for both single-level PCA and mPCA. Indeed, Figure 9 shows that differences between groups via mPCA are fairly similar for n l = 100 compared to n l = 300. Experiments 4 and 5 indicate again (and very broadly) that reasonable results are obtained when the sample sizes in all groups are of similar magnitude to the number of variables, i.e., 63 components for Experiments 4 and 5.

9
Please punctuate equations as regular text. Theorem-type environments (inclu propositions, lemmas, corollaries etc.) can be formatted as follows: The text continues here. Proofs must be formatted as follows: Proof of Theorem 1. Text of the proof. Note that the phrase "of Theorem 1" is optio it is clear which theorem is being referred to.
The text continues here.

Discussion
Authors should discuss the results and how they can be interpreted from the pe tive of previous studies and of the working hypotheses. The findings and their implica should be discussed in the broadest context possible. Future research directions ma be highlighted.

Conclusions
This section is not mandatory, but can be added to the manuscript if the discuss unusually long or complex.

Patents
This section is not mandatory, but may be added if there are patents resulting fro work reported in this manuscript.

Discussion
An exploration of "pathologies" of mPCA was carried out here by considering a twolevel mPCA model that is equivalent to bgPCA. It was clear that spurious differences between groups due to random sampling effects contained in all variables in Experiments 1, 2, and 4 were concentrated in the (relatively few) components at level 1 for mPCA. This effect meant that mPCA therefore falsely gave an impression of strong differences in component scores where in truth there were none. As stated in [8,9,[17][18][19][20], pathologies of bgPCA (and therefore also mPCA) do exist, mostly strikingly in terms of interpretation of

Discussion
An exploration of "pathologies" of mPCA was carried out here by considering a two-level mPCA model that is equivalent to bgPCA. It was clear that spurious differences between groups due to random sampling effects contained in all variables in Experiments 1, 2, and 4 were concentrated in the (relatively few) components at level 1 for mPCA. This effect meant that mPCA therefore falsely gave an impression of strong differences in component scores where in truth there were none. As stated in [8,9,[17][18][19][20], pathologies of bgPCA (and therefore also mPCA) do exist, mostly strikingly in terms of interpretation of these component scores when sample sizes are low in any of the groups. However, these spurious differences in component scores via mPCA decreased significantly as the sample sizes per group were increased.
Note that 3D facial scanning of human subjects can be costly because it can be time consuming and/or labour intensive. Sample sizes might often measured in tens of subjects only, where such "pathologies" are likely to manifest. However, sample sizes per group can be a problem even for large-scale epidemiological studies in humans if one is interested for example in subsets of subjects with rare syndromes that can affect facial appearance and shape. Similar problems might occur in archaeology or palaeontology, where the number of samples to be scanned might naturally be constrained. Another limitation of mPCA is that the rank of covariance matrices at higher levels of the model are limited to the number of groups minus one. In practice, this places a limit on the number of non-zero eigenvalues at these levels.
Imbalances in sample sizes in different groups can be addressed by using a form weighted covariance matrices inspired by the maximum likelihood solution, previously also suggested in [9]. Our results suggested that this "weighting" had a beneficial effect on covariance matrices and eigenvalues. However, weighting did not solve all of the problems of spurious differences in component scores between groups, which were due here to a very small sample size in one group. Experiment 2 demonstrates that misleading results for component scores persist via mPCA if the sample sizes are low in any of the groups. Notably, the usefulness of such "weighting" was questioned also in [8]. However, such weighting schemes might be useful when such imbalances occur and when sample sizes per group are sufficiently large enough in all groups. This topic requires more investigation in future.
Our calculations also indicated that single-level PCA underestimated the total amount of variance when between-group variation was introduced explicitly to the data generation model in Experiment 3. For example, Figure 4 showed that mPCA results extrapolated to a value that was very close to the theoretical asymptotic value for the total variation in Experiment 3, whereas single-level PCA did not. These results were also supported by evidence in Table 2. However, this is exactly what one would expect as the models used in MC data generation and via mPCA were essentially identical. Very similar results were seen in Experiment 5 where between-group variation was introduced to correlated 3D MC data representing 3D facial shape. Traditional PCA is essentially just a single-level method and so one would not expect it to capture the effects of such multilevel structures and/or of "clustering." Interestingly, the sums of eigenvalues via mPCA scaled approximately linearly with the inverse of the sample size per group in all Experiments 1 to 5, which is another potentially important result of this research. We speculate that this might be another manifestation of the Marchenko-Pastur theorem [21].
The simulations presented here for the uncorrelated normally distributed variables in Experiments 1 to 3 are the most severe (and artificial) test of both mPCA and single-level PCA as any apparent structure to the data is due purely to random sampling effects. It was noted (e.g., in [9]) that these problems of spurious differences between group for bgPCA are reduced when the multivariate data is correlated, which generally is the case in reality, e.g., for shape data. The evidence from Experiments 4 and 5, in which correlated multivariate normally distributed data was generated, were inconclusive in relation to this claim specifically, although they do not contradict it. However, the total number of variables was much lower in Experiments 4 and 5 compared to Experiments 1 to 3, which makes it harder to compare results on an equal footing. Experiments 4 and 5 do underline that the results presented here are clearly relevant to modelling shape, such as those illustrated by Figure 1 for 3D facial shape and as described in [11][12][13][14][15][16][17].
The results of this work show broadly that reasonable results ought to be obtained when sufficiently large sample sizes per group are used in all groups. As a "rule of thumb" only, sample sizes per group in all groups should be at least equal to the number of variables. However, modes of variation from mPCA should also always be examined critically and they should be compared to known results in the literature where they are known to exist, e.g., known changes in facial shapes in humans due to sex [12]. The author of [8] presents a detailed list of recommendations about the use (or refraining from use) of bgPCA in relation to biological morphometrics. The authors of [19] propose cross-validation as a method of overcoming these problems, whereas the authors of [20] propose a mixture of permutation tests and cross-validation. The interested reader is referred to [8,19,20] for more details.
Funding: There was no external funding for this research.
Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: Synthetic data was generated using Monte Carlo methods specified in the Section 2.1 and so this is not applicable.