Unsupervised inference approach to facial attractiveness

The perception of facial attractiveness is a complex phenomenon which depends on how the observer perceives not only individual facial features, but also their mutual influence and interplay. In the machine learning community, this problem is typically tackled as a problem of regression of the subject-averaged rating assigned to natural faces. However, it has been conjectured that this approach does not capture the complexity of the phenomenon. It has recently been shown that different human subjects can navigate the face-space and “sculpt” their preferred modification of a reference facial portrait. Here we present an unsupervised inference study of the set of sculpted facial vectors in such experiments. We first infer minimal, interpretable and accurate probabilistic models (through Maximum Entropy and artificial neural networks) of the preferred facial variations, that encode the inter-subject variance. The application of such generative models to the supervised classification of the gender of the subject that sculpted the face reveals that it may be predicted with astonishingly high accuracy. We observe that the classification accuracy improves by increasing the order of the non-linear effective interaction. This suggests that the cognitive mechanisms related to facial discrimination in the brain do not involve the positions of single facial landmarks only, but mainly the mutual influence of couples, and even triplets and quadruplets of landmarks. Furthermore, the high prediction accuracy of the subjects’ gender suggests that much relevant information regarding the subjects may influence (and be elicited from) their facial preference criteria, in agreement with the multiple motive theory of attractiveness proposed in previous works.


Introduction to the Maximum Entropy principle: Correlations vs effective interactions
Consider an n-dimensional space of vectors, x = (x i ) n i=1 ∈ χ, along with a set of K observables, O k : χ → R, k = 1, . . . , K. The maximum entropy approach [1,2,3,4] provides the most probable probability distribution P (x|λ), x ∈ χ, which is consistent with a fixed value of the operators, in the sense that their average according to P , O k P is constraint to assume a fixed value, ). In other words the maximum entropy probability distribution P me is the one exhibiting highest entropy (i.e., the most random, or less structured distribution) subject to the constraint (0.1), and to no other constraint. It assumes the form: Z(λ) being a normalizing constant. The maximum entropy probability distribution is, hence, formally identical to a Maxwell-Boltzmann distribution in the canonical ensemble at temperature = 1, with effective Hamiltonian H = − k λ k O k . It is important to remark that no assumption at all has been done about thermal equilibrium, ergodicity, nor about the existence of an effective interaction in energy units: the Maxwell-Boltzmann form is a consequence of the maximum entropy assumption -reflecting, rather, absence of hypothesis-of a probability distribution subject to constraints on the average of some operators. The values of the Lagrange multipliers λ's in (0. 2) are such that the constraints in (0.1) are satisfied.
In the context of unsupervised statistical inference, one infers from a finite number M of experimental measures of the observables O k , to which correspond the values o (m) k , m = 1, . . . , M . The maximum entropy distribution provides a generative probabilistic model for the data, that is aimed to be a faithful representation of the experimental dataset and, at the same time, a generalisation of the dataset, not too dependent on the specific realisation of the database that is being inferred. For this reason, P is chosen to reproduce the experimental value of a limited set of observables, depending on the dataset. Ideally, a faithful and general model should be consistent with the minimum set of experimental averages that allow to reproduce some essential database properties and, at the same time, that may be significantly inferred given the database finiteness. = O k (x (m) ). The parameters λ are called effective interactions, in the language of statistical physics. In the case that the observables to be reproduced by P are the data correlations of order n ≤ p (where the correlations of order n are defined as C (n) i1,··· ,in = x i1 · · · x in ), the effective interactions assume the form of n-th order tensors J (n) coupling n-plets of vector coordinates, with n = 1, . . . , p.
A self-consistency criterion for the choice of the sufficient statistics O k is that of calculating different nontrivial observables according to P (different from the sufficient statistics, i.e., observables that P is not required to reproduce by construction, and that cannot be expressed in terms of the sufficient statistics), and comparing them with their experimental counterparts. In particular, a criterion is that of choosing the n ≤ p-th order correlations as sufficient statistics, with p being the minimum value such that the p + 1-th order experimental correlations are satisfactorily reproduced by P me (i.e., x i1 · · · x ip+1 x i1 · · · x ip+1 P ), and such that all the parameters corresponding to such sufficient statistics may be significantly inferred from the data.
Correlations and effective interactions. The effective interaction tensors J (n) may admit, in certain circumstances, an interpretation regarding the mutual effective influence among variables, beyond the statistical correlation among them (whose experimental value is C (n) ). Correlations and effective interactions are actually different. Focusing for simplicity in p = 2, the pairwise correlations are but the statistical consequence of the effective interactions among couples of landmarks causing them. This is the case in the direct problem (the calculation of C (2) from J (2) ): in this case, it may happen that the matrix J (2) is sparser than C (2) : there are couples of variables not influencing each other that, nevertheless, result statistically correlated. In the direct problem, the Maximum Entropy method may allow for a discrimination of the spurious correlations of couples of components that are correlated although they do not influence each other (but are, instead, commonly and mutually influenced by other components). This a frequent phenomenon in biological data [5,6], with an obvious interpretation in statistical physical terms: in the general case, the mutual influence among a sparse set of couples of bodies propagates statistically, leading to emergent, collective phenomena. A paradigmatic and extreme case of this general phenomenology is critical behaviour [7], in which microscopic interactions lead to macroscopic correlations: long-range and high-order body correlations originate from short-range, sparse and pairwise interactions [5]).
Probably the simplest illustration of the emergence of spurious statistical correlations is that of three variables (x 1 , x 2 , x 3 in fig. 0.1, of which only two of them are strongly interacting (in the figure J 12 = J 13 = 2), while the second and third are moderately interacting (or even negatively interacting, as in the figure: J 23 = −1). Such an information is not accessible from the emerging The line width is proportional to the absolute value |A ij | of the corresponding matrix element. The dashed line in the J triangle indicates that J 23 is negative (i.e., there is a tendency of 2 to decrease when 3 increases and vice-versa). Such tendency is, however, not reflected in the correlation matrix, which presents all positive elements. correlations x i x j P (·|J) (in the direct problem), revealing a strong, positive correlation among all the variables. Conversely, in the inverse problem (i.e., when an empirical correlation matrix is given, resulting from an average of a sufficiently high number of measures), the Maximum Entropy method may provide not only a generative model, P (·|J (n) * ), but also the most probable interaction matrices J (n) * suggesting that, indeed, the correlation among 2 and 3 is (most likely, given the data and the sufficient statistics, and within the Maximum Likelihood hypothesis) a statistical consequence of the mutual influences of 1, 2 and of 1, 3. This information is not unambiguously elicited from the data, but the result of an inference procedure: the most probable guess given the inference model and the ambiguity induced by the data finiteness.

Maximum Entropy inference from pairwise correlations with a priori constraints
In this section we solve the problem of the Maximum Entropy (MaxEnt) inference from pairwise correlations (i.e., p = 2), in the presence of linear constraints involving the coordinates.
In the absence of constraints, the Maximum Likelihood solution to the problem, equation (0.3) is analytic and straightforward. Suppose that one infers from a database composed by S experimental The sufficient statistics to infer from is by hypothesis the correlation matrix (supposing null-average vectors): C ij = x i x j where · represents, as before, the experimental average: a symmetric, positive definite matrix. The MaxEnt probability distribution is, consequently, the multi-variate normal distribution: The Maximum Likelihood solution for the matrix J, J * , is that satisfying that the theoretical pairwise correlations x i x j P = J −1 ij coincide with the experimental correlations C ij . This is satisfied whenever J * = C −1 .
We now consider the presence of linear constraints involving the coordinates, x i . Each linear constraint may be expressed in the form a † j x = c j , being a j a real D-dimensional vector and c j a real constant, for the j-th constraint. If all the vectors in the database {x (s) } S s=1 , are subject to the constraints, each constraint induces a null mode (a zero eigenvalue) in the experimental covariance matrix. In this case, the Maximum Likelihood solution to the problem, i.e., the probability distribution P (·|J) such that x i x j P = C ij cannot simply be J * = C −1 , since matrix C actually exhibits a vanishing determinant.
We will see that Maximum Likelihood solution in this case is J * = C −1 , where the −1 exponent means the pseudo-inverse operation, a generalisation of the matrix inverse operation in which the null eigenvalues are avoided. We define the pseudo-inverse of the real, square matrix A as: are the k-th eigenvalue and the j-th component of the k-th eigenvector of A, respectively.
We first consider the solution of the direct problem, x i x j P from J, in a situation in which the interaction matrix J is such that rank(J) = r < D. In other words, J exhibits D − r null eigenvalues. Suppose that the eigenvalues λ j of J are ordered in decreasing order, so that λ j = 0 for j = r + 1, . . . , D. In this case, the probability distribution P (x) in equation 0.4 is, trivially, constantly zero since the determinant of matrix J vanishes. However, we can define a real function in the space of the D-dimensional variables x: whereZ is a normalising factor involving the non-zero eigenvalues of J only: The functionP may be considered as a normalised probability distribution, but only over the r-dimensional subspace of R D expanded by the first r eigenvectors of J: S + = span{e (k) } r k=1 with 1 ≤ k ≤ r. In other words,P is a probability distribution on the subspace of R D , S + , defined by the vectors that are already subject to the constraints, for any value c j 's of the constants associated to the constraints.
One can easily define a proper, normalised probability distribution P defined in R D , by regularising the null modes associated to the constraints: where the D − r-dimensional vector (x r+1 , . . . , x D ) is a vector of the projection of x in a basis of vectors expanding the space of the constraints (as the vectors a j defining the constraints, before, x j+r = a † j x). On the other hand, we define the r-dimensional vector x = (x 1 , . . . , x r ) as the projection of x over the first r eigenvectors of J (associated to a non-null eigenvalue): x = Ex, where E is the r × D matrix defined as the row-disposed eigenvectors, E ij = e (i) j 1 .
To each of the null eigenvalues corresponding to a constraint a·x = c, is associated an invariance ofP with respect to the linear operator G(c) that changes the value of the constraint, i.e., such that a · (G(c)x) = c:P Indeed, G acts on the subspace S 0 only, while it is the identity in the subspace S + (where S 0 is defined as the complement of S + , i.e., R D = S + × S 0 ). In the physical language, each eigenvector corresponding to a constraint is called a null mode, and represents a symmetry reflected in the invariance of the functionP with respect to the symmetry. The function P , in its turn, represents vectors for which the symmetry is broken, as the value of the constraint has been fixed. For example, if the vectors are constrained to have a constant sum of its components, D i=1 x i = c, the corresponding eigenvector, or null mode e, has all its components equal to e i = D −1/2 . Consequently, the functionP is invariant under scale transformations.
We are now interested in the calculation of a general n-order cumulant x s1 · · · x sn P according to the distribution P in (0.8), with s j = 1, . . . , D. As it can be seen immediately, the n-th order cumulant is related to the n-th order derivative of the generating function through the standard cumulant expansion equation: where the generating functionZ[h] has the form: We notice thatZ[0] =Z. We would like an analytical expression forZ [h]. Using the relations x = Ex and J = E † ΛE, where Λ is the r × r diagonal matrix whose diagonal is λ 1 , . . . , λ r , one obtains:Z where h = Eh. Using Gaussian integration rules, one finds: j . This equation is the generalisation of the standard expression for the generating function of the multi-variate normal distribution, to the case in which matrix J is be non-invertible. As it is evident from the expression ofZ[h] and from the cumulant expansion (0.10), and as it happens with the normal distribution, the only non-zero cumulant is the second-order cumulant x i x j P , equal to the correlation x i x j P in the case of null-averaged vectors. Its form is, from equation 0.10: where J −1 is the pseudo-inverse of matrix J.
The theoretical correlation matrix whose elements are x i x j P exhibits, as matrix J does, rank equal to r. Hence, the theoretical correlation matrix M ij = x i x j as a function of J (i.e., the direct problem) is M = J −1 . Consequently, the J * that is needed to satisfy C ij = x i x j P given an experimental correlation matrix C (i.e., the inverse problem) is J * = C −1 , where the −1 power means the pseudo-inverse operation.

Constraints in the database of facial modifications
As we have explained in the main text, the facial modifications in the 2017 experiment are defined in terms of a set of 17 2D landmarks which are redundant in the sense that the positions of some of them may be deduced in terms of 10 coordinates only. The facial landmarks are, in particular, symmetric by construction, and hence the coordinates of the right-side landmark are determined given those of the left side. We have, hence, considered a subset of n = 8 landmarks only (see figure 0.2 Furthermore, the landmark coordinates r c,i (where c = x, y) in the database S are still subject to 6 constraints: indeed, 2 × n − 6 = 10, the numer of degrees of freedom. For instance, the nose endpoint abscissa is constrained to lie in the center of the image, ∆ x,9 = 0, and the jaw landmark is defined to be at the same heigth of the mouth, ∆ y,3 = ∆ y,7 . We have intentionally kept such redundant information in the inferred training database. Indeed, the redundant information turns to be necessary for the correct interpretation of the inference parameters, as we will see in section 0.8.
The database S = {∆ (s) } S s=1 of facial displacements is, hence, highly constrained in the various facial coordinates. The following constraints hold, for all the vectors in the database: coordinates are, instead, the x, y 2D coordinates r i of a subset of seven landmarks, marked with blue circles. Right: all the landmarks used for the facial deformation algorithm described in [8].
The last three constraints ensure that the eye aspect ratio remains unchanged with respect to the average facial vector ∆ = 0 (otherwise, the image deformation algorithm corresponding to the landmark deformation 0 → ∆ could lead to an ellipse-like shaped eye). Indeed, the constants in the right-hand side of each equation correspond to the value that assumes the left-hand side quantity in the average facial vector. Each of these constraints induces a null mode in one of the correlation matrices C (xx) , C (yy) , C (xy) (those involving only x's coordinates, in C (xx) ; those involving only y's, in C (yy) ; those involving a x and a y coordinate, in C (xy) ). As we have shown before, the inverse problem in this case is solved through the matrix pseudo-inverse operation. In these circumstances, the probability distribution L(·|J, h) described in the main article refers to a probability distribution in the 10-dimensional sub-space of coordinates that are invariant under the symmetries associated to the constraints (P , in the notation of section 0.2). Strictly speaking, to become a proper probability distribution in the space of facial modification vectors ∆ it has to be regularised as in section 0.2: whereP is the distribution that in the main article is called L = exp(−H)/Z, the product is over the ∆ components over the 6 eigenvectors of the global correlation matrix with a null eigenvalue, λ µ = 0, and ∆ = E∆, with E being the matrix of column-eigenvectors of C (and of J).

Correlation vs interaction matrices
In the particular case of our database, the main source of spurious correlations is not collective behaviour but the presence of the a priori constraints among various landmark coordinates, which are imposed in the experimental construction of the face space vectors (see sec. 0.3 and [8] for a precise description of the constraints), and that play the role of the strong interaction 1, 2 in fig. 0.1. The MaxEnt method subtracts the effect of such constraints and provides a sparser interaction matrix. Our MaxEnt inference scheme discounts the effect of constraints since we eliminate the matrix C eigenvectors corresponding to the constraints (through the pseudo-inverse operation C −1 ), see sec 0.8 for an in-depth discussion. In figure 0.3 we present a comparison among the matrices C (xx) and J (xx) , C (yy) and J (yy) , C (xy) and J (xy) . As a general observation, the effective matrices are sparser than the correlation matrices, as expected. In particular, while both C (xx) 6,3 and C (xx) 7,3 are statistically significant, only J 7,3 is (the effective interaction attributes the 6 − 3 correlation to the ∆ 7,x − ∆ 3,x = 0 constraint). The same happens, for instance with C (yy) 1,3 and C (yy) 0,3 , statistically significant, while only J 0,3 is (the 1 − 3 correlation is attributed to the 0 − 1 constraint).
We conclude that the effective interaction coupling matrix J provides information beyond the experimental correlations, since it disambiguates the correlations propagated by the constraints, attributing them to the effect of a reduced set of couplings. In section 0.8 we illustrate the fact that an alternative method of avoiding the constraints, consisting in fitting a dataset in which the redundant variables are eliminated (instead of keeping them and avoiding the influence of the constraint-eigenvectors), may lead to J matrices whose interpretation is misleading.

Longitudinal and Torsion interaction strengths
The n × n vertical, horizontal and oblique correlation matrices are defined as the corresponding correlations among landmark fluctuations: C (xx) ij = ∆ i,x ∆ j,y , and the same for C (xy) , C (yy) . The whole 2n × 2n correlation matrix C is defined as C µν = ∆ µ ∆ ν , where the 2n Greek indices µ = i, c i denote the c i = x, y coordinates of the i-th landmark. We define analogously the vertical, horizontal and oblique interaction matrices. The relation among these matrices is given by: where the −1 power means the pseudo-inverse operation.
In their turn, the longitudinal and torsion interaction matrices, J , J ⊥ , correspond to the displacements along, and normal to, the segment joining the landmarks i and j, calledê ij = r ij /r ij , where r ij = r j − r i and r ij = | r ij |. These are defined so that the matrix elements J ij , J ⊥ ij are the J (xx) ij and J (yy) ij couplings, but in a (ij−dependent) rotated basis such that the x-axis coincides with the i, j inter-landmark segment versor,ê ij . Henceforth, the J and J ⊥ matrices are not obtained by a rotation of the original matrices J (xx) and J (yy) . Instead, each J ij element results from a whole inference procedure in a different basis depending on the couple ij. In particular, is the inferred matrix obtained from the pseudo-inverse of matrix C(ê ij ) in a coordinate system in which the x axis coincides with the ij-segment (in other words, where R ij is the 2D rotation matrix by the angle −α ij , and the matrix product is over the x and y blocks of matrix C).
We remark that there is less information in J ij , J ⊥ ij (for all i, j) than in the whole effective interaction matrix J (since the matrix whose matrix elements are J (xy) ij (ê ij ) is not J , nor J ⊥ ). We now provide a clearer interpretation of the longitudinal and torsion effective interaction matrices. J ij , J ⊥ ij capture the relative relevance of the fluctuations around the average distance r ij , and of angle fluctuations around α ij , respectively. Large values of J ij imply that the distance among i and j in the direction of its average axis is highly "locked", i.e., it tends to exhibit small fluctuations, from sample to sample, around its most probable value. . Mind that the J matrix is such that negative matrix elements represent ferromagnetic couplings, or affine interactions. As a general trend, matrix J is sparser than matrix C, as expected. The matrices C and J exhibit similar matrix elements, except in couples of coordinates involved in the same constraint. In the top row, the diagonal has been set to zero.
fluctuation of the 6, 7 segment distance δ 6,7 will give rise to a small or non-significant decrement of the probability of the resulting facial vector since the longitudinal coupling constant J 6,7 is small, a fact that highlights the prominent importance of the inter-landmark distance r 4,7 over r 6,7 in the process of facial discrimination. In the same way, fluctuations in the transversal components of both segments, δ ⊥ 4,7 and δ ⊥ 6,7 (and consequent fluctuations of the inter-landmark segment angles around α 4,7 and α 6,7 ), have a strong impact in their probability of being sculpted (i.e., in their perceived attractiveness), since both torsion coupling constants J ⊥ 4,7 and J ⊥ 6,7 are large in absolute value.

Dependence of J on inter-landmark distances and angles
A different, interesting aspect of the matrix of effective interactions J is the dependence of the interaction strengths among landmarks i, j as a function of their average distance r ij and average segment angle α ij . We stress that r ij and α ij are meta-parameters in the sense that that they are not codified in the database S and, hence, are not inferred (the facial vectors ∆ are actually fluctuations around the average single-landmark positions). From a cognitive point of view, one would expect that the interaction strengths |J (xx) ij |, |J (yy) ij |, |J (xy) ij | among couples of nearby landmarks should tend to be stronger for smaller values of r ij or, at most, that they do not present an increasing trend (which would mean that farther away landmarks influence each other more than closer landmarks). In its turn, if the ij coupling absolute value decreases with α ij , this would indicate the prominence of horizontal over vertical inter-landmark segments, and vice versa.
The data does not allow for sharp conclusions at these regards. However, and although the absolute value of the J matrix elements do not show a clear trend with r ij nor with α ij , some interesting information can be retrieved from such analysis. Indeed, a moderate decreasing trend is observed in |J ij | vs. r ij , signifying that nearer landmarks tend to influence each other more than farther away landmarks, but only along the inter-ij landmark segment, in the sense that only the longitudinal coupling presents such trend. Interestingly, the trend is lost when the x, y, xy components of J are plotted vs. r ij . The absence of a clear trend with α ij indicates lack of prominent importance of horizontal versus vertical inter-landmark segments.
We show in figures 0.4,0.5 the quantities J ij , J ⊥ ij versus r ij and α ij , respectively (see the main article). Although no clear trend is observed, it is apparent a moderate decreasing trend of |J ij | versus r ij , as referred in the main article, and a slight decreasing trend of |J ⊥ ij | versus α ij . We notice that, in the notation of the article, negative values of J indicate the tendency to positive correlations (a ferromagnetic interaction, in the statistical-physical language). 0.7 The Harmonic inference in the limit C (xx) , C (yy) C (xy) It can be shown that the solution of the inverse problem, at first order in the limit C (xy) C (xx) , C (yy) ), is: and that, in this limit, it is: The relative influence of oblique correlations may be also assessed by defining a simpler model, that we will call the null-xy model, consisting in neglecting oblique interaction terms (taking J (xy) = 0). An even simpler model, that we will call dot model, consists in neglecting oblique interactions and supposing that the couplings J (xx) and J (yy) are equal: In this case the probability distribution is simply: We have assessed the efficiency of the dot and null-xy models by evaluating their efficiency in the classification task. As we show in section 0.16, neglecting the oblique correlations (in the null-xy model) and the anisotropy of vertical/horizontal correlations (in the dot model) leads to a poorer performance. This provides a quantitative assessment of the relative influence of these terms. We conclude that the influence of oblique correlations is crucial, and not negligible, in the facial perception process.

Two ways of inferring with constraints in the database of facial modifications
In section 0.2, we have exposed a method of MaxEnt inference (from pairwise interactions) from a database exhibiting linear constraints. Within this method, all the D components of the vectors are considered, and inferred from, despite they are redundant. The resulting experimental correlation matrix C is singular as it exhibits D − r null eigenvalues, each one corresponding to a constraint. However, the influence of the constraints on the inferred model is subtracted by defining a probability distribution in the subspace of the coordinates that are invariant under the linear operation associated to the constraint. Mathematically, this is done through the pseudo-inverse operation (see eq. 0.14), which discards the subspace expanded by the eigenvectors corresponding with null eigenvalue. The corresponding inferred probability distribution corresponds to a system which is invariant under rescaling of the constraints c j , equation (0.9). An alternative method to infer P avoiding the influence of constraints consists in inferring only a subset of r non-redundant, unconstrained variables, in terms of which the correlation matrix has rank equal to r. As mentioned before and in the main article, this method may lead to a matrix of effective interactions leading to a less clear interpretation. The J ij elements will reflect in this case the influence of the constraints in the considered r variables. Oppositely, with the null-mode subtraction method, the J matrix represents a system which already satisfies the constraint (see section 0.4) and, for this reason, the J ij matrix elements do not reflect its influence on the data.
An illustration of these concepts is shown in the main article, where we compare matrices C and J. The null-mode subtraction method provides a matrix J which is actually sparser than matrix C. This does not occur when inferring from a reduced, non-redundant set of variables.
A further, particularly clear illustration is seen in terms of inter-landmark distances d = (d i ) 10 i=0 , an alternative parametrization of the facial vectors ∆ (see the precise definition in [8] and in figure 0.2) in terms of 11 vertical or horizontal distances separating couples of landmarks. The function that maps a vector of inter-landmark distances d into a vector of landmark coordinates ∆ is oneto-one (and depends on some distances of the reference portrait). The distances d are subject to a constraint, reflecting the scale invariance of the problem [8]: We observe that the variables involved in the constraint result to be anticorrelated, C (−k) ij < 0 when both i, j are in the set 1, 2, 3, 4, a fact fact may be attributed to the presence of the constraint (e.g., vectors with larger distances d 1 tend to exhibit lower d 2 's, since, for all vectors, it is necessary an anti-ferromagnetic interaction, or a statistical tendency of variable i to decrease when variable j increases, in order that the theoretical distribution associated to J (k) describes the statistics of the set of variables. Such statistical tendency is on the top of other statistical tendencies, of cognitive origin, not related to the constraint. In other words, the matrices J (−k) describe the data statistics of two different origins: those associated to the constraint, and those of cognitive origin.
Second, in figure 0.8 we present the resulting effective interaction with the null-mode subtraction method, using as J the pseudo-inverse of matrix C. We observe that, interestingly, all the couplings −J ij > 0 when both i, j belong to the set {1, 2, 3, 4}, implying a ferromagnetic effective interaction in the physical language, or a positive tendency of d i to increase when d j increases. In the pseudoinverse case, J represents the statistical effective interaction with the influence of the constraint subtracted. We learn information of cognitive, "physical" origin from J, that was veiled in the matrices J (−k) 's, influenced by the presence of the constraint. In particular, we learn that the experimental subjects tend to prefer higher eyes, higher d 1 , in facial modifications with larger noses, larger d 2 .
In conclusion, the method of null-mode removal that we have used in this work allows, in general, for a more faithful interpretation of the effective parameters with respect to the alternative method of inferring from a set of non-redundant variables. It is important to stress that the generative model obtained from J and from J (−k) is expected to be equally faithful. In other words, the difference discussed in this section regards but the interpretation of J ij elements, and only in situations in which the parameters actually admit an interpretation, as it is the case in the present and in other problems in biophysics and neurophysics. Indeed, the efficiency of both generative models as a classifier in the two groups of subjects (male/female, S = S A ∪ S B ), results to be equivalent, see section 0.16.

Average proportions and pairwise correlations
Facial beauty has been related to proportions since the Renaissance [9,10], and most modern machine learning studies pose the problem in terms of proportions too [11,12,13,14,15,16].  In the main text we have explained that the dataset is faithfully described by a MaxEnt probability distribution L(x|J, h), whose sufficient statistics is the matrix of pairwise correlations. We have also argued that, for a complete statistical description of the database of facial modifications, a model based on pairwise correlations is not enough. This implies that proportions, or ratios among facial distances, contains most of the information present in the database, although there is significant information, of cognitive origin, beyond proportions. We here justify such statement, making notice that the information regarding facial proportions is codified in the matrix of correlations among couples of facial distances.
Consider two facial coordinates, r α , r β , referring to the x or y coordinates of two landmarks, say i and j. We will consider their ratio, r α /r β , which is the mathematical expression of a proportion. Callingr α = r α the experimental average value, one has r α =r α + ∆ α , by definition of displacement ∆ α . The displacements around the average, ∆ α , are much lower than the averages r α , for all coordinate, α (see [8]). This justifies a Taylor expansion of r α /r β for low ∆'s. Indeed, to the second order in the ∆'s: The experimental average of this expression r α /r β is, up to an additive constant, equal to −(1/r 2 β ) ∆ α ∆ β (having used that ∆ α = 0). Hence, the average proportions are completely determined, in the case of small displacements, by the pairwise correlations.

Harmonic interactions and elastic constants
In the main article we have explained that the 2-MaxEnt model for vectors of facial distance displacements ∆ may be interpreted as the Maxwell-Boltzmann distribution corresponding to a set of particles in 2D interacting through a set of three anisotropic, couple-dependent springs, in the canonical ensemble. We will here justify such statement. We will focus in a couple of landmarks, say i, j. We will call x i , x j the components of the position of landmarks i, j over two versors in the plane,ê (i) ,ê (j) , respectively. In other words: x i = r i ·ê (i) and the same for j, in the notation of the main article. Given the coordinates x i , x j , and ifê (i) =ê (j) , the quantity is the change in the distance among i and j with respect to the average vector, and along the common axisê (i) . For example, if the versor is the vertical axis, δ ij indicates the shift of the vertical distance among landmarks i, j with respect to the average distance among i, j. We will define the elastic interaction energy as (1/2)k ij δ 2 ij , which is minimum and equal to zero whenever the distance among i, j is unchanged with respect to the average, δ ij = 0, regardless on the single-landmark displacements x i , x j . We make notice that expanding, again, in δ i = x i −x i to the second order in δ i and δ j , it is: , where b is a constant in δ i and δ j , depending only onx i andx j . Henceforth, the elastic interaction energy, E = (1/2)k ij δ 2 ij is, up to a constant, and for small fluctuations around the average, = −2δ i δ j k ij , which is the form of the interaction energy in the pairwise Hamiltonian model with the following relation among elastic constant and effective interaction matrix element: k ij = −(1/2)J ij . Fixing, for example, e (i) =ê (j) =ê ij , the versor joining the average position of both landmarks,ê ij = r i − r j /| r i − r j |, we have that k ij = −(1/2)J ij , and idem for the perpendicular components of r, ⊥, and for the vertical, horizontal and oblique components of r.

Cognitive origin of non-linear correlations
In the experiments presented in [8], the subject sculpts her/his ideal facial modification through the interaction with a software called FACEXPLORE, based on genetic and image deformation algorithms. The sculpture process consists in a sequence of multiple left/right choices among couples of facial images, eventually leading to an estimation of the ideal modification according to the subject. Actually, the genetic algorithm performs the recombination and mutation steps, while the single experimental subject actually plays the role of the selection step, through her/his choices.
The genetic algorithm used (called Differential Evolution) processes different coordinates independently (see the SI of [8]). The only correlation among coordinates is expected to be induced by the selection process, performed by the human subject. As a consequence, one should expect that the only origin of correlations among coordinates in the populations sculpted by subjects (by the same or by different subjects) are of cognitive nature.
In fact, this is not the case: part of the correlations that one observes experimentally are due to an artifact of the algorithm. In a null-model experiment with a random sequence of left/right choices, the resulting database exhibits significant non-linear correlations of order 2 and 3 among facial coordinates. The correlations of order three, ∆ i ∆ j ∆ k , are statistically compatible with the 3-order correlations observed in the human experiment [8].
The solution of this paradox is that, while the genetic algorithm does not introduces correlations in the recombination and mutation steps, it actually may amplify the correlations among facial coordinates which are present in the initial condition of the null-model genetic population of facial vectors. Such initial populations are trivially correlated, since some constraints were imposed in the definition of the face space: mainly 4 i=1 d i = 1 and d 10 < d 5 (see section 0.8 and figure 0.2). In reference [8] we proposed a method to "subtract" the influence of the a priori, non-cognitive or artifact correlations present in the null model experiment, from the cognitive true correlations that we observed. The method revealed that the artifact pairwise correlations did not have a significant impact in the results. We suspect that correlations of higher order may be, instead, significantly influenced by the artifact effect.
In the main article text, we have explained that non-linear inference algorithms allow for a much better classification of the database according to the gender of the experimental subject. This fact implies that, quite interestingly, the differences between facial vectors sculpted by males and by females is encoded in non-Gaussian correlations, beyond proportions (p = 2), beyond triplets and perhaps quadruplets of facial distances. In the main article, we have also explained that this may imply that such differences codified in non-Gaussian correlations are of cognitive order, i.e., that male and female subjects' do prefer facial variations differing in non-Gaussian correlations and, in particular, that humans evaluate quantities that are much more complicated than proportions, when forming an impression about a face. The fact that the introduction of non-linearity helps in a gender classification task, which reflects real and well-known cognitive differences, may suggest so.
An alternative explanation is that the distinguishable differences among male and female preferred facial variations are all codified in pairwise effective interactions only (say, roughly speaking, that males and females differ only in the J matrix, if it could be measured without bias). The non-linear interactions would turn anyway relevant for the classification, since the correlations propagated by the genetic algorithm are coupled to the ones induced by the subject: subjects differing only in J would also induce, by means of the artifact, differences in the correlations of higher order.
Further experiments are needed to clarify this issue.

Generality of the unsupervised inference models
Crucially, the two models of unsupervised inference presented in the main text exhibit a wide generality, going beyond the particular database that we infer in this work. (1) First, the inferred set of facial vectors may be composed by facial images selected according to any criterion: selected by a pool of subjects among real facial images or by a single individual (in this case the distribution L(·|θ) would probabilistically characterise the single subject's preferred region in face-space); selected according to a criterion different from attractiveness; even not having been selected by subjects but chosen according to some objective criterion as age or gender (L(f |θ) would hence represent the probability that a facial image characterised by the facial vector f presents the desired feature). (2) Second, they can be used to infer any other database of images characterised by the geometric positions of facial (or, in general body) landmarks. (3) Third, these models may be immediately extended to process also non-geometric degrees of freedom (treating the texture and geometric degrees of freedom on the same footing [17]). In the case of the 3-MaxEnt model, we have estimated the maximum likelihood value of the parameters θ * by means of a numerical maximisation of the joint database likelihood by deterministic gradient ascent (see also [18]).
At a given epoch t of the gradient ascent iteration, the gradient of the joint likelihood with respect to the effective interaction components involves a theoretical correlator (of order 1,2 or 3) according to the current value of the couplings. For instance: ∂ J αβ ln L(S|θ) θ(t) = ∆ α ∆ β L(·|θ(t)) − ∆ α ∆ β . Such theoretical correlator is in its turn estimated by means of a Markov Chain Monte Carlo (MCMC) Metropolis algorithm for the sampling of configurations from the theoretical distribution at the corresponding epoch, L(·|θ(t)). For such MCMC algorithm, we use a number of sweeps = 10 6 in each epoch. The MCMC vectors ∆ are initialised as normal variables with variance equal to their empirical variance, and the Metropolis trials are chosen uniformly in the interval ∆ α ∈ [−6σ α , 6σ α ], where σ α is the empirical standard deviation of ∆ α . This corresponds to maximising the likelihood function: where H is the multivariate Heaviside function in the hyper-parallelepiped centered in the origin ∆ = 0 and of side lengths 2B µ = 12σ µ , and where H(·|θ) = H 2 (·|θ 2 ) + H 3 (·|Q). Finally, in order to monitor the state the convergence to the stationary state, in every step of the gradient ascent maximisation we evaluate the difference of the empirical average of the Hamiltonian and its expected value according to L(·|θ), the iteration stops when |∆H(t)| decreases below a certain threshold [19,18]. The condition ∆H = 0 is necessary for the maximum likelihood condition. As initial values of the learning dynamics, we choose h = 0, J = I D , Q = 0. The value of B µ = 6σ µ has been chosen so that all the empirical vectors lie in the parallelepiped B and, at the same time, the maximum likelihood probability distribution exhibits a single local maximum in its support B. Indeed, the probability distribution projected along the data principal components (∆ µ or the projection of ∆ on the µ-th eigenvector of J), is qualitatively a perturbed normal distribution with variance σ 2 µ (obtained for Q = 0), with asymmetric and fatter tails (see [18]).

Learning the database with the Gaussian Restricted Boltzmann Machine
Definition of the model. The Gaussian Restricted Boltzmann Machine (GRBM) is a type of generative stochastic two-layered Artificial Neural Network [20,21,22,23]. It is a generalisation of the Restricted Boltzmann Machine (RBM) model [24], that learns a probabilistic generative model for real-valued vectors: the visible neurons in the input layer, v, assume real values. The value of the hidden neurons h is, instead, binary, h j = 0, 1. The state of the neurons is described by an energy-based probability density: in terms of the parameters θ = {W, b, c, σ}, to be inferred in the learning process. W is a real N v × N h matrix coupling real and visible variables, while c, σ are N v -dimensional real vectors representing the bias over the visible neurons and their standard deviation, respectively, while b is a real N h -dimensional vector representing the bias over hidden neurons. Z θ is a normalising constant, depending on the parameters. The function energy E is defined so that the conditional probability distribution p(v|h, θ) results to be a normal, independent distribution over visible variables. It assumes the form: The GRBM probabilistic generative model is obtained through a marginalisation of the hidden variables: p(v|θ) = h p(v, h|θ). This model (as far as the hidden neurons are binary) is known to induce non-linear interactions among the visible variables, up to order p = N v in the most general case [25].
Learning protocol. We have trained the model over a set of redundant or non-redundant data, obtaining equivalent results. As a learning algorithm we have used gradient ascent through persistent contrastive divergence with k = 1 Monte Carlo step, along with mini-batch learning with batch size B [21], and an epoch-depending variable learning rate, η, increasing linearly with the number of epochs. We have set the value of the learning hyperparameters to: number of steps n s = 2 · 10 5 , batch size B = 200, momentum µ = 0, initial learning rate η 0 = 2 · 10 −3 . The learning rate slope is set such that η ns = 2 · 10 −5 . The parameters W , b and c are initialized following a standard procedure [26,27]: As equilibration test we have verified that the test-set joint likelihood is stationary as a function of the number of epochs, within its associated standard deviation. We have performed an assessment of the algorithm efficiency as a function of the number of hidden neurons, N h . As shown in figure 0.9, both the test and training-set joint likelihood exhibit a monotonous increasing behaviour vs N h , showing no sign of severe overfitting. The auROC score saturates at its maximum value for values N h 8N v , with N v = 10, confirming this picture. We have consequently considered, for the analysis performed in the main article, N h = 100. Before GRBM learning, the data has been pre-processed eliminating 6 redundant coordinates, subtracting the average (of the whole database, not of the {A, B}×{train, test} datasets separately) and standardising, or dividing each vector component-wise by the vector of standard deviations Afterwards, for the sake of the classification, the model has been trained over the A, B training databases separately, leading to two sets of parameters θ A , θ B and, consequently, to a likelihood function L RBM (·|θ A,B ). Afterwords, the score s(r) = ln L(r|θ A ) − ln L(r|θ B ) is defined for every standardised and non-redundant vectorr of the A, B test-sets. Such score is used to construct the ROC curve and scores shown in the main article.

Classification with the Random Forest algorithm
In the random forest classification presented in the main article and in figure 0.10, we have used the Random Forest Classifier [28,29], using 1000 trees created from bootstrapped sub-sample and with nodes expanded until all leaves are pure. As an assessment of the single-split quality we have considered the Gini function. The number of random features considered in the best split choice is equal to int(sqrt(D)), where D = 16 is the number of features.

Detailed comparison among several classification methods
We now present a more detailed analysis of all the classification algorithms that we have considered for the classification of the database according to the subjects' gender. In table 0.1 we present a systematic comparison of the auROC value [30], a standard estimator of the classification accuracy (the area under the corresponding ROC curves in figure 0.10), associated to the classification according the various algorithms. In particular, 2-MaxEnt approximated is the 2-Maxent model resulting from the approximation in equation (0.23); 2-MaxEnt null-xy is the model consisting neglecting the oblique interactions, J (xy) = 0; 2-MaxEnt dot is defined in equation (0.27); 1-MaxEnt dot is defined by inferring the external fields only (and taking the interaction matrix J, required for the normalisation of P , as a diagonal matrix whose diagonal is equal to the inverse variance of each variable).
The results of table 0.1 and of figure 0.10 confirm the picture presented in the main article. The value of the single facial distances (in units of the facial length) are not enough for an accurate description of the database of facial modifications. The introduction of pairwise effective interactions, which explain proportions, or ratios of facial coordinates, induces a notable improvement in the statistical description. Moreover, oblique effective interactions (coupling the x coordinate of one landmark with the y coordinate of another landmark) result a fundamental ingredient. Finally, a crucial role, at least for the sake of the classification according to the subjects' gender, is plaid by effective interactions of higher order: p = 3 (3-MaxEnt) and p > 3 (GRBM and random forest). We conclude that the classification is a valid method for the assessment of the assessment of the relative relevance of the various terms.
Remarkably, an as we anticipated in section 0.8, the algorithm used to avoid the constraints (inferring from a reduced, non-redundant set of variables, or using the null-mode subtraction method) do not change the efficiency of the classification. Indeed, the model that we call 2-MaxEnt non-redundant in table 0.1 and in figure 0.10 is identical to 2-MaxEnt but in terms of a subset of 10 non-redundant variables. Its auROC estimator and ROC curve are statistically distinguishable from 2-MaxEnt (with 16 variables and null-mode subtraction). 0.17 Inter-and intra-subject correlations and errors.
The set of facial vectors sculpted by a single subject, , are not the result of independent sculpting experiments. They are, rather, correlated as far as they are the outcome genetic population of facial vectors that evolved according to a stochastic evolutionary algorithm coupled to a sequence of choices performed by the experimental subject [8]. Consequently, it is crucial to subtract the effect of intra-subject (or intra-genetic population) correlations among facial vector components from the inter-subject correlations. On the one hand, one may define the bare correlation matrix, accounting from both sources of correlations, defined by summing over both subject and population indices: C αβ = ∆ α ∆ β . On the other hand, the inter-subject correlation matrix accounts only for the inter-subject correlation, and is defined asC αβ = (1/n s ) where v(v ) and i(v ) are random indices in the sets 1, . . . , n s and 1, . . . , N respectively, uncorrelated among them and on v , and the overline · means an average over a sufficiently high number of realisations of the set of indices v(v ), i(v ) for v = 1, . . . , n s . The statistical uncertainty associated to the inter-subject correlation, σ C αβ , is the standard deviation of the overline argument under many realisations of the set of indices (in other words, a Bootstrap error using only one vector for subject in each Bootstrap sampling, see the SI of ref. [8]). Consequently, the error associated to the inter-subject correlation is of order ∼ n s −1/2 , and not of order ∼ S −1/2 as that of the bare correlation matrix. Analogously, we also define inter-subject and bare 3-component correlations.
If the inferred model should describe the probability of a given facial vector to have been selected by any subject in the database, then it should be committed to reproduce by construction the inter-subject (not the bare) correlations. Otherwise, the probabilistic generative models may also simply describe the whole set of facial vectors in the [8] experiments, hence accounting also for the intra-subject correlations; the corresponding MaxEnt models would reproduce by construction the 2 or 3 bare correlations in this case. In our data analysis software one can specify whether the 2, 3 MaxEnt inferred model reproduce bare or inter-subject correlations. In this article, some results (the reproduction of angle histograms and the analysis of J matrices) correspond to the inter-subject inference models. The classification tests have, instead, been done with the bare models. For the sake of classification, we have simply tested the ability of the algorithm to capture any useful correlation, regardless of its origin, cognitive or algorithmic. The bare inference models suffer less from the curse of dimensionality since the effective database size is S = N n s instead of n s .

Description of the E1 experiment
We now recall a description of the experiment E1 presented in [8].
The face-space. The face-space used in [8] is based on a separation of texture/geometric degrees of freedom [14]. The face-space facial vectors codify the geometric degrees of freedom only (corresponding to the Cartesian coordinates r of the set of landmarks), while the texture degrees of freedom correspond to a real portrait called reference portrait. Given a vector r and the reference portrait, we construct, throught image deformation algorithms (based on affine linear transformations) a novel image whose landmarks occupy the positions given by r, and whose texture is coherently and realistically deformed from that of the reference portrait. The reference portrait used in the E1 experiment, called RP1, is fixed for all the subjects.
The aim of the experiment is to provide a population of N facial vectors for each experimental subject. Such a population aims to be an empirical sample of the subject's preferred modifications of the reference portrait. This means that the subject probabilistically prefers facial images associated with vectors that are close to the vectors in the population, rather than local fluctuations away from it. The subject does not sculpt the population by successive discrimination among faces differing by a single coordinate, which turns out to be an inefficient strategy of face-space exploration, but rather through the interaction with a genetic algorithm, implemented by our software FACEXPLORE. In such (differential evolution) genetic algorithm, the individuals are the N facial vectors, their genetic code is the corresponding vector of geometric coordinates, and the selection process is performed by the subject, who iteratively select, according to her/his personal criterion, which facial vectors will survive in the next generation.
General description. The subjects sculpt their preferred variation of a facial vector, which codifies geometric coordinates. Starting from N initial random facial vectors, the FACEXPLORE software generates pairs of facial images (composed by each original facial vector and by a potential offspring generated by mutation and recombination). Each pair is presented to the subject, who selects the one that she/he finds more attractive. Based on N left/right choices, the genetic algorithm produces a successive generation of N vectors. This process is repeated T times, in a constant feedback loop of offspring generation and selection operated by the subject. Such iterations leads to a sequence of T generations of facial vectors, each one more adapted than the last to the subject's selection criteria, eventually converging to a pseudo-stationary regime in which the populations are similar to themselves and among consecutive generations. The T -th generation is taken as the empirical sample of the subject's preferred modifications of the reference portrait.
Some details of the experimental setup. The experiment was performed by a pool of n s = 95 volunteers (54 female, 39 male, of age average and standard deviation: 26(12)), mainly students, researchers and professors of the Sapienza University, in Rome. Each subject performed a number of N T = 280 choices among couples of facial images. These are uncompressed 400 × 300 pixel, B/W images of 72 pix/inch resolution in an 1024 × 768 monitor. The viewing distance is 65(10)cm. The reference portrait RP1 has been taken from the Chicago face database [31]. The experiment lasted roughly 25 minutes on average.