Shape-invariant encoding of dynamic primate facial expressions in human perception

Dynamic facial expressions are crucial for communication in primates. Due to the difficulty to control shape and dynamics of facial expressions across species, it is unknown how species-specific facial expressions are perceptually encoded and interact with the representation of facial shape. While popular neural network models predict a joint encoding of facial shape and dynamics, the neuromuscular control of faces evolved more slowly than facial shape, suggesting a separate encoding. To investigate these alternative hypotheses, we developed photo-realistic human and monkey heads that were animated with motion capture data from monkeys and humans. Exact control of expression dynamics was accomplished by a Bayesian machine-learning technique. Consistent with our hypothesis, we found that human observers learned cross-species expressions very quickly, where face dynamics was represented largely independently of facial shape. This result supports the co-evolution of the visual processing and motor control of facial expressions, while it challenges appearance-based neural network theories of dynamic expression recognition.


Introduction
Facial expressions are crucial for social communication of human as well as non-human primates (Calder, 2011;Darwin, 1872;Jack and Schyns, 2017;Curio et al., 2010), and humans can learn facial expressions even of other species (Nagasawa et al., 2015). While facial expressions in everyday life are dynamic, specifically, expression recognition across different species has been studied mainly using static pictures of faces (Campbell et al., 1997;Dahl et al., 2013;Sigala et al., 2011;Guo et al., 2019;Dahl et al., 2009). A few studies have compared the perception of human and monkey expressions using movie stimuli, finding overlaps in the brain activation patterns induced by within-and cross-species expression observation in humans as well as monkeys (Zhu et al., 2013;Polosecki et al., 2013). Since natural video stimuli provide no accurate control of the dynamics and form features of facial expressions, it is unknown how expression dynamics is perceptually encoded across different primate species and how it interacts with the representation of facial shape.
In primate phylogenesis, the visual processing of dynamic facial expressions has co-evolved with the neuromuscular control of faces (Schmidt and Cohn, 2001). Remarkably, the structure and arrangement of facial muscles is highly similar across different primate species (Vick et al., 2007;Parr et al., 2010), while face shapes differ considerably, for example, between humans, apes, and monkeys. This motivates the following two hypotheses: (1) The phylogenetic continuity in motor control should facilitate fast learning of dynamic expressions across primate species and (2) the different speeds of the phylogenetic development of the facial shape and its motor control should potentially imply a separate visual encoding of expression dynamics and basic face shape. The second hypothesis seems consistent with a variety of data in functional imaging, which suggests a partial separation of the anatomical structures processing changeable and non-changeable aspects of faces (Haxby et al., 2000;Bernstein and Yovel, 2015).
We investigated these hypotheses, exploiting advanced methods from computer animation and machine learning, combined with motion capture in monkeys and humans. We designed highly realistic three-dimensional (3D) human and monkey avatar heads by combining structural information derived from 3D scans, multi-layer texture models for the reflectance properties of the skin, and hair animation. Expression dynamics was derived from motion capture recordings on monkeys and humans, exploiting a hierarchical generative Bayesian model to generate a continuous motion style space. This space includes continuous interpolations between two expression types ('anger' vs. 'fear'), and human-and monkey-specific motions. Human observers categorized these dynamic expressions, presented on the human or the monkey head model, in terms of the perceived expression type and species-specificity of the motion (human vs. monkey expression).
Consistent with our hypotheses, we found very fast cross-species learning of expression dynamics with a typically narrower tuning for human-compared to monkey-specific expressions. Most importantly, the perceptual categorization of expression dynamics was largely independent of the facial shape (human vs. monkey). In particular, the accuracy of the categorization of species-specific dynamic facial expressions did not show a dependence on whether the species-specific expressive motion and the avatar species were matching (e.g., monkey expressions being recognized more accurately on a monkey avatar). Our results were highly robust against substantial variations in the expressive stimulus features. They specify fundamental constraints for the computational neural mechanisms of dynamic face processing and challenge popular neural network models, accounting for expression recognition by the learning of sequences of key shapes (e.g. Curio et al., 2010).

Results
In this section, we briefly sketch the methodology of our experiments; whereas many other important details can be found in 'Materials and methods' section and 'Appendix 1'. Then, we describe in detail the results of the three main experiments, which we realized (further control experiments are described in 'Appendix 1').
Our studies investigated the perceptual representations of dynamic human and monkey facial expressions in human observers, exploiting photo-realistic human and monkey face avatars ( Figure 1A). The motion of the avatars was generated exploiting motion capture data of both primate species ( Figure 1B), which were used to compute the corresponding deformation of the surface mesh of the face, exploiting a model based on elastic ribbon structures that were modeled after the main facial muscles of humans and monkeys ( Figure 1C and Appendix 1).
In order to realize a full parametric control of motion style, we exploited a Bayesian motion morphing technique ('Materials and methods') to create a continuous expression space that smoothly interpolates between human and monkey expressions. We used two human expressions and two monkey expressions as basic patterns, which represented corresponding emotional states ('fear' and 'anger/threat'). Interpolating between these four prototypical motions in five equidistant steps, we generated a set of 25 facial movements that vary in five steps along two dimensions, the expression type, and the species, as illustrated in Figure 1D. Each generated motion pattern can be parameterized by a two-dimensional style vector (e, s), where the first component e specifies the expression type (e ¼ 0: expression 1 ('fear') and e ¼ 1: expression 2 ('anger/threat')), and where the second variable s defines the species-specificity of the motion (s ¼ 0: monkey and s ¼ 1: human). The dynamic expressions were used to animate a highly realistic monkey as well as a human avatar model (generation; 'Materials and methods'). In order to vary the two-dimensional stimulus features, we rendered the avatars from two different view angles: from the front view and from the view that was rotated by 30 degrees about the vertical axis. This rotated view maximized the differences of the two-dimensional appearance relative to the front view, while avoiding strong salient changes, such as occlusions of face parts. The following sections describe the results of the three main experiments of our study.

Dynamic expression perception is largely independent of facial shape
In our first experiment, we used the original dynamic expressions of humans and monkeys as prototypes and presented morphs between them, separately, on the human and the monkey avatar faces, with two different view angles (0 and 30 degrees rotation about the vertical axis). Facial movements of humans and monkeys are quite different (Vick et al., 2007), so that our participants, all of whom had no prior experience with macaque monkeys, needed to be familiarized briefly with the monkey  Regularized face mesh, whose deformation is controlled by an embedded elastic ribbon-like control structure that is optimized for animation. (D) Stimulus consisting of 25 motion patterns, spanning up a two-dimensional style space with the dimensions 'expression' and 'species', generated by interpolation between two expressions ('anger/threat' and 'fear') and the two species ('monkey' and 'human'). Each motion pattern was used to animate a monkey and a human avatar model. expressions prior to the main experiment. During the familiarization, participants learned to recognize the four prototypical expressions perfectly, always with maximally four stimulus repetitions. During the main experiment, motions were presented in a block-randomized order, and in separate blocks for the two avatars and for the two tested views. The expression movies with a duration of 5 s showed the face going from a neutral expression to the extreme expression and back to neutral ( Figure 1A). Participants observed 10 repetitions of each stimulus. They had to decide whether the observed stimulus was looking more like a human or a monkey expression (independent of the avatar shape and view), and whether the expression was rather 'anger/threat' or 'fear'. The resulting two binary responses in each trial can be interpreted as assignment of one out of four classes to the perceived expression of the stimulus, independent of avatar type and view (1: human-angry, 2: human-fear, 3: monkey-threat, and 4: monkey-fear). Figure 2A shows the raw classification data as histograms of the relative frequencies of the four classesĈ i e; s ð Þ, as a function of the style parameters e and s for the four tested classes. The class probabilities P i e; s ð Þ were modeled by a logistic multinomial regression model ('Materials and methods'), resulting in the fitted discriminant functions that are shown in Figure 2B for the different classes. Comparing regression models with different sets of predictor variables, we found that in almost all cases, a model of the form that contains the two style variables for expression (e) and the species (s) as predictors (in addition to a constant predictor) was the simplest model that provided good fits of the data. Figure 2C shows the prediction accuracy of regression models 0 0 with different sets of predictors for the monkey avatar stimulus (data from the other conditions are presented in Appendix 1). The different models were compared quantitatively using prediction accuracy and the Bayesian Information Criterion (BIC). Specifically, a model that also included the product eÁs did not provide significantly better prediction results, except for a very small improvement of the prediction accuracy for the rotated view conditions. Models only including one of the predictors, e or s, provided significantly worse fits. Likewise, models that contained the average amount of optic flow as the additional predictor did not result in higher prediction accuracy (see Appendix 1 for details.). This implies that simple motion features, such as the amount of optic flow, do not provide a trivial explanation of our results. Summarizing, both style variables e and s are necessary as predictors, and there is no strong interaction between them. This motivated us to use the model with these two predictor variables for our further analyses. Figure 3A shows a comparison of all fitted discriminant functions, shown separately for the two avatar types and for the two tested view conditions. These functions show the predicted class probabilities as functions of the two style parameters e and s. The form of these discriminant functions is highly similar between the two avatar types and also between the view conditions. This is confirmed by the fact that the fraction of the variance that is different between these functions divided by the one that is shared between them does not exceed 3% (q ¼ 2:75%; see 'Materials and methods'). The same conclusion is also supported by a comparison of the multinomially distributed classification responses using a contingency table analysis (see 'Materials and methods'), across the four conditions (avatar types and views), separately for the different points in morphing space and across participants. This analysis revealed that only for three stimuli (12%) of the style space, the classification responses were significantly different (p ¼ 0:02, Bonferroni-corrected   Figure 1D (i = 1: human-angry, 2: human-fear, 3: monkey-threat, 4: monkey-fear). (A) Results for the stimuli generated using original motion-captured expressions of humans and monkeys as prototypes, for presentation on a monkey and a human avatar. (B) Significance levels (Bonferroni-corrected) of the differences between the multinomially distributed classification responses for the 25 motion patterns, presented on the monkey and human avatar.
with high perceptual ambiguity ( Figure 3B). This result implies that primate facial expressions are perceptually encoded largely independently of the head shape (human vs. monkey) and of the stimulus view. Especially, this implies substantial independence of this encoding of the two-dimensional image features, which vary substantially between the view conditions, and even more between the human and the monkey avatar model. This observed independence might also explain why many of our subjects were able to recognize human facial expressions on the monkey avatar face, even without any familiarization. This matches the common experience that humans can recognize dynamic facial expressions spontaneously even from non-human comic figures, which often are highly unnatural.
Tuning is narrower for human-specific than for monkey-specific dynamic expressions A biologically important question is whether expressions of one's own species are processed differently from those of other primate species, potentially supporting an own-species advantage in the processing of dynamic facial expressions (Dahl et al., 2014). In order to characterize the tuning of the perceptual representation for monkey vs. human expressions, we computed tuning functions, by marginalizing the discriminant functions belonging to the same species category (P 1 and P 2 belonging to the human, and P 3 and P 4 to the monkey expressions) over the expression dimension e (see 'Materials and methods' for details). Figure 4A shows the resulting two species-tuning functions D H (s) and D M (s), revealing a smaller tuning width for the human than for the monkey expressions for all stimulus types, except for the 30 degrees rotated human condition. The fitted threshold values are given by the conditions D M s th ð Þ; D H s th ð Þ ¼ 0:5 and are shown in Figure 4B for the monkey-and the human-specific motion (solid vs. dashed lines). This observation is confirmed by computing the threshold values of the tuning functions by fitting them with a sigmoidal function (see 'Materials and methods'). Comparing the threshold values by running separate ANOVAs for the four stimulus types (monkey and human front view, and monkey and human rotated view), we found significantly narrower tuning for the human than for the monkey expression for all tested conditions, except for the human avatar in the 30 degrees condition. These two-way mixedmodel ANOVAs include the expression type (human vs. monkey motion) as within-subject factor and the stimulus type (original motion, stimuli with occluded ears, or animated with equilibrated motion; see below) as between-subject factor. The ANOVAs reveal a strong effect of the expression type (F . Summarizing, there is a strong tendency of the species-specific expression tuning to be narrower for the human 'own-species' expressions, while this tendency is not as prominent in rotated views.

Robustness of results against variations of species-specific features
One may ask whether the previous observations are robust with respect to variations of the chosen stimuli. For example, monkey facial movements include species-specific features, such as ear motion, that are not present in human expressions. Do the observed differences between the recognition of human and monkey expressions depend on these features? We investigated this question by repeating the original experiment, presenting only the front view, with a new set of participants, using stimuli for which the ear region was occluded. Figure 5A depicts the corresponding fitted discriminant functions, which are quite similar to the ones without occlusion, characterized again by a high similarity in shape between the human and monkey avatar (ratio of different vs. shared variance: q ¼ 1:44%; only 12% of the categorization responses over the 25 points in morphing space were significantly different between the two avatar types; p ¼ 0:02; Figure 5B). Figure 4A also shows that the corresponding tuning functions D M and D H are very similar to the ones for the non-occluded stimuli, and the associated threshold values ( Figure 4B) are not significantly different from the one for non-occluded stimuli (see 'ANOVA analysis'). Summarizing, the elimination of ear motion as a monkeyspecific feature did not have a major influence on the main results of the original experiment.

Robustness against variations of expressivity
A further possible concern might be that the chosen prototypical expressions might specify different amounts of salient low-level features, for example, due to species differences in the motion, or because of differences between the anatomies of the human and the monkey face. In order to control for the influence of such expressive low-level information, we re-ran the main conditions of the   experiment with stimuli that were equilibrated (balanced) for the amount of such expressive information.
Our equilibration procedure was based on a pilot experiment that compared equilibration methods based on different types of measures for low-level information. This included the total amount of optic flow (OF), the maximum deformation of the polygon mesh during the expression (DF), and the total motion flow of the polygon mesh during the expression (MF) (see 'Materials and methods' for details). In the control experiment, nine participants rated these equilibrated stimulus sets in terms of the perceived expressivity of their motion (independent of avatar type). Perceived expressivity was assessed by ratings using a nine-point Likert scale (1: non-expressive, 9: very expressive), presenting each stimulus in a block-randomized manner for four times.
The averages of these ratings, comparing the different low-level measures, are shown in Figure 6A. In addition, this figure also shows the ratings for the neutral expression, which are very low, and the ratings for the original non-equilibrated expressions. It turns out that balancing the amount of polygon motion (MF) resulted in the lowest standard deviation of the expressivity ratings after equilibration (except for the neutral condition 1:479; p<0:021).
More specifically, perceived expressivity showed smaller variance for the MF condition than for the DF conditions for the human avatar (F 1; 142 ð Þ ¼ 1:479; p<0:021). Also, for the monkey avatar, this variance was smaller than for all other conditions (F>1:403; p<0:045), except for the DF condition (F 1; 142 ð Þ ¼ 0:869; p>0:407). Moreover, the difference of the perceived expressiveness between the two avatars was non-significant (t 283 ð Þ ¼ 0:937; p>0:349) for equilibration with the DF measure. For these reasons, and also because it resulted in the equilibrated stimuli with the highest expressivity, we decided to use MF as a measure of the equilibration of the prototype motion in our main experiment (a more extensive analysis of these data and additional tested measures for lowlevel expressive information are discussed in Appendix 1).
Equilibration was based on creating morphs between the original motion-captured expressions and a neutral expression, varying the morphing weight of the neutral expression in a way that resulted in a matching of the amount of motion flow (see 'Materials and methods'). Equilibration was realized separately for the two avatars and also for the different view conditions. Figure 6B shows an example of the effect of equilibration on the extreme frames of a monkey-threat expression. The equilibration also reduces the very salient mouth opening motion of the monkey, which, due to anatomical differences, cannot be realized by a real human face. The efficiency of the procedure in balancing the amount of motion information is illustrated in Figure 6C. It illustrates the motion flow before and after equilibration for the different points of our motion style space for the front view. 0 degree Monkey av.
Human av.
P 2 (e,s) P 1 (e,s) P 3 (e,s) P 4 (e,s) P 2 (e,s) P 1 (e,s) P 3 (e,s) P 4 (e,s)  Figure 1D (i = 1: human-angry, 2: human-fear, 3: monkey-threat, 4: monkey-fear). (A) Results for the stimuli generated using original motion-captured expressions of humans and monkeys as prototypes but with occluded ears, for presentation on a monkey and a human avatar (only using the front view). (B) Significance levels (Bonferroni-corrected) of the differences between the multinomially distributed classification responses for the 25 motion patterns, presented on the monkey and human avatar.
The standard deviation of the motion flow across the 25 conditions in style space is reduced by 83% for the monkey avatar and by 54% for the human avatar by the equilibration. Constraining the flow analysis to the mouth region, we found that the standard deviation of the corresponding motion flow across conditions was reduced by 79% for the monkey avatar and by 59% for the human avatar (results for the other view conditions are similar).
The fitted discriminant functions for the data from the repetition of the experiment with equilibrated stimuli are shown in Figure 7A. These functions are more symmetrical along the axes of the morphing space than for the original stimuli (for example, this reduces the amount of confusions of human anger and monkey fear expressions that occurs for intermediate levels of the style parameters, especially for the human avatar, potentially due to the subtlety of the monkey fear expression). This is corroborated by the fact that an asymmetry index (AI) that measures the deviation from a perfect symmetry with respect to the e and s axes (see Appendix 1) is significantly reduced for the data from the experiment with equilibrated stimuli compared to the data from the experiment using the original motion prototypes (AI original ¼ 0:624 vs: AI equilibrated ¼ 0:504), the difference being significant according to the Wilcoxon signed-rank test (Z ¼ 2:49; p<0:013). Compared to the original stimuli, we found an even higher similarity of the discriminant functions between the two avatar types and the different view conditions. This is corroborated by the small ratios of different vs. shared variance between the conditions (q ¼ 4:01%), where only 4% of the categorization responses across the 25 points in morphing space were significantly different between the avatar types and view conditions, according to a contingency table analysis ( Figure 7B).
Most importantly, also for these equilibrated stimulus sets, we found a narrower tuning for the human than for the monkey dynamic expressions ( Figure 4A). This is confirmed by the results of the ANOVA for the threshold points of the tuning functions D M (s) and D H (s) ( Figure 4B), which failed to  show a significant influence of the factor stimulus type (original vs. occluded vs. equilibrated stimuli) (see above). An analysis of the steepness of the fitted threshold functions is shown in Figure 4C. This analysis shows that the equilibration procedure effectively balances the steepness of the tuning functions between the human and the monkey expressions, which is apparent in the non-equilibrated stimuli. This observation is confirmed by two-way ANOVAs for the original motion stimuli and the ones with occluded ears, which show significant influences of the factor avatar type/view (F 3; 83 Summarizing, these results show that the high similarity of the classification data of the stimuli between the two different avatar types, and between the different view conditions, was not fundamentally changing if the expressiveness of the stimuli was controlled. Also, the tendency for a narrower tuning for human own-species expressions was robust against this manipulation. However, balancing expressiveness leveled out the differences in the steepness of the computed species-tuning functions. This rules out the objection that the observed effects are just an implication of differences in the amount of low-level salient features of the chosen prototypical motion patterns.   Figure 7. Fitted discriminant functions P i (e,s) for the experiment with equilibration of expressive information. Classes correspond to the four prototype motions, as specified in Figure 1D (i = 1: human-angry, 2: human-fear, 3: monkey-threat, 4: monkey-fear). (A) Results for the stimuli set derived from prototype motions that were equilibrated with respect to the amount of local motion/deformation information, for presentation on a monkey and a human avatar. (B) Significance levels (Bonferroni-corrected) of the differences between the multinomially distributed classification responses for the 25 motion patterns, presented on the monkey and human avatar.

Discussion
Due to the technical difficulties of an exact control of dynamics of facial expressions (Knappmeyer et al., 2003;Hill et al., 2005), in particular of animals, the computational principles of the perceptual representation of dynamic facial expressions remain largely unknown. Exploiting advanced methods from computer animation with motion capture across species and machine-learning methods for motion interpolation, our study reveals fundamental insights about the perceptual encoding of dynamic facial expressions across primate species. At the same time, the developed technology lays the ground for physiological studies with highly controlled stimuli on the neural encoding of such dynamic patterns (Polosecki et al., 2013;Chandrasekaran et al., 2013;Barraclough et al., 2005;Furl et al., 2012).
Our first key observation was that facial expressions of macaque monkeys were learned very quickly by human observers. This was the case even though monkey expressions are quite different from human expressions, so that naive observers could not interpret them spontaneously. This fast learning might be a consequence of the high similarity of the neuromuscular control of facial movements in humans and macaques (Parr et al., 2010), resulting in a high similarity of the structural properties of the expression dynamics that can be exploited by the visual system for fast learning.
Secondly, and unexpectedly from shape-based accounts for dynamic expression recognition, we found that the categorization of dynamic facial expressions was astonishingly independent of the type of primate face (human vs. monkey) and of the stimulus view (0 vs. 30 degrees of rotation of the head about the vertical axis). Clearly, this shows a substantial degree of invariance against changes of the two-dimensional image features. More specifically, we neither found strong differences between categorization responses dependent of these parameters, nor did we find a better perceptual representation of species-specific dynamic expressions that match the species of the avatar (e.g., a more accurate representation of human expressions on the human avatar or of monkey expressions on the monkey avatar). Facial expression dynamics seems thus represented independently of the detailed shape features of the primate head and of the stimulus view.
Yet, we found a clear and highly robust own-species advantage (Scott and Fava, 2013;Pascalis et al., 2005) in terms of the accuracy of the tuning for expression dynamics: the tuning along the species axis of our motion style space was narrower for human than for monkey expressions. This remained true even for stimuli that eliminated species-specific features, such as ear motion, or which were carefully equilibrated in terms of the amount of low-level information.
Both key results support our initial hypotheses: perception can exploit the similarity of the structure of dynamic expressions across different primate species for fast learning. At the same time, and consistent with a co-evolution of the visual processing of dynamic facial expressions with their motor control, we found a largely independent encoding of facial expression dynamics from a basic facial shape in primate expressions. This observed independence has to be further confirmed in more extended experiments, including a bigger spectrum of facial shapes and, probably, even faces from non-primate species. In fact, the observation that humans observe facial expressions readily from comic characters, which do not even correspond to existing species, suggests that the observed invariance goes far beyond primate faces. However, further experiments including a much wider spectrum of facial shapes will be required to confirm this more general hypothesis.
The observed independence of basic facial shape and expression encoding seems in-line with results from functional imaging studies that suggest a modular representation of different aspects of faces, such as changeable and non-changeable ones (Haxby et al., 2000;Bernstein and Yovel, 2015;Dobs et al., 2019). At the same time, it is difficult to reconcile our experiments with several popular (recurrent) neural network models that represent facial expressions in terms of sequences of learned key shapes (Curio et al., 2010;Li and Deng, 2020). Since the shape differences between human and monkey faces are much larger than the ones between the keyframes from the same expression, the observed spontaneous generalization of dynamic expressions to faces from a different primate species seems difficult to account for by such models.
Concrete circuits for a shape-independent encoding of expression dynamics still have to be discovered. One possibility is that they might exploit optic-flow analysis (Giese and Poggio, 2003;Jhuang et al., 2007), since the optic flow of the expressions on different head models might be similar. Another possibility would be mechanisms that are based on 'vectorized encoding', where the face shape in individual stimulus frames is encoded in terms of their differences in feature space from a 'reference' or 'norm face' (Giese, 2016;Leopold et al., 2006;Rhodes and Jeffery, 2006;Beymer and Poggio, 1996). We have demonstrated elsewhere that a very robust recognition of dynamic facial expressions can be accomplished by a neural recognition model that is based on this encoding principle (Stettler, 2020), where norm-referenced encoding had been shown to account for the identity tuning of face-selective neurons in area IT (Leopold et al., 2006;Giese and Leopold, 2005). The presented novel technology for the generation of highly realistic dynamic face avatars of humans and monkeys enables electrophysiological studies that clarify the exact underlying neural mechanisms. A similar methodological approach was quite successful for discovering of the neural mechanisms of the identity of static faces (e.g., Leopold et al., 2006;Murphy and Leopold, 2019;Chang and Tsao, 2017).

Materials and methods
Key resources

Human participants
In total, 78 human participants (42 females) participated in the psychophysical studies. The age range was 21-53 years (mean 26.2, standard deviation 4.71). All participants had no prior experience with macaque monkeys and normal or to-normal corrected vision. Participants gave written informed consent and were reimbursed with 10 EUR per hour for the experiment. In total, 31 participants (16 females) were taking part in the first experiment using stimuli based on the original motion capture data and the experiment with occlusion of the ears. 22 participants (13 females) took part in the experiment with equilibrated motion of the prototypes. In addition, 16 participants (eight females) took part in a Turing test control experiment (see below), and nine (five females) participants took part in a control experiment to identify features that influence perceived expressiveness of the stimuli. All psychophysical experiments were approved by the Ethics Board of the University Clinic Tü bingen and were consistent with the rules of the Declaration of Helsinki.

Stimulus presentation
Subjects were presented the stimuli watching a computer screen at a distance of 70 cm in a dark room (view angle about 12 degrees), with a resolution of 720 Â 720 pixels using MATLAB and the Psychotoolbox (3.0.15) library for stimulus presentation. Each stimulus was repeated for a maximum of three times before asking for the responses, but participants could skip after the first presentation if they were certain about their responses. Participants were first asked whether the perceived expression was rather from a human or a monkey, and whether it was rather the first or the second expression. Responses were given by key presses. Stimuli for the two different avatar types were presented in different blocks, with 10 repeated blocks per avatar type.

Dynamic monkey and human head model
For our experiments, we exploited a monkey and a human dynamic face avatar with a very high degree of realism. The monkey head model was derived from a structural magnetic resonance scan of a rhesus monkey (9 years old, male). The surface of the face was modeled by an elastic mesh structure ( Figure 1C) that imitates the deformations induced by the major face muscles of macaque monkeys (Parr et al., 2010). To accomplish a highly realistic appearance, special methods for hair animation and an appropriate modeling of skin reflectance were applied ( Figure 1A). The human head was based on a scan-based commercial avatar with blend-shape animation, exploiting a multichannel texture simulation software. Mesh deformations compatible with the human face muscle structure were computed from motion capture data in the same way as for the monkey face. Further technical details about the creation of these head models are described in Appendix 1. The used dynamic head models achieve state-of-the-art degree of realism for the human head, and to our knowledge, we present the only highly realistic monkey avatar that is animated with motion capture data from real animals used in physiology so far. It has been demonstrated by a recent study of our lab that our dynamic monkey avatar induces behavioral reactions of macaque monkeys that are very similar to ones elicited by real movies, reaching the 'good side' of the uncanny valley (Siebert et al., 2020), contrasting with previous studies using avatars in experiments with monkeys (Chandrasekaran et al., 2013;Campbell et al., 2009;Steckenfinger and Ghazanfar, 2009). A related result has been obtained recently for static pictures of monkeys, demonstrating comparable looking times for the avatar and real pictures of monkey expressions, but without expressive motion of the face (Bilder and Lauhin, 2014).

Motion generation and style space
The animation of our avatars was based on motion capture data of two real monkey and human expressions. For motion capture, we used a VICON motion capture system with a marker set of 43 markers that were placed on the face of a monkey and a human participant. Facial expressions were elicited by instructions, or by having the animal interact with an experimenter, respectively. For this study, we exploited multiple repetitions of two human and two monkey expressions (anger/threat and fear), and additional trials with neutral expressions. Further details about motion capture and the transfer of the motion to the head models are given in Appendix 1.
In order to control the information content and the expressivity of the dynamic face stimuli, we created motion morphs between these prototypical expressions. For this purpose, we exploited a method that is based on a generative Bayesian model of the trajectories of the control points of the face. This algorithm allows to create linear combinations in space-time between these prototypical motions, controlling smoothly the expressiveness and the style of the created facial motion. We verified in an additional control experiment (Turing test) that animations created with the original motion capture data were indistinguishable from the ones generated with motion trajectories generated with this Bayesian algorithm (reproducing the prototypes by the generative Bayesian model) (see Appendix 1 about details concerning this algorithm and the Turing test experiment).

Modeling of the classification responses
Using a multinomial logistic regression analysis, the relative frequencies of the four classesĈ j e; s ð Þ were approximated by class probabilities P j e; s ð Þ for the four classes that were modeled by a generalized linear model (GLM) of the form P i e; s ð Þ ¼ e yi P 4 j 0 ¼1 e y j 0 (1) The variables y j were given by linear combinations of predictor variables X i in the form We compared a multitude of models, including different sets of predictors. The most compact model was linear in the style space variables e and s and was given by the equation We also tested variants of linear models that included the predictor variable e Ã s and a predictor variable that is proportional to the total amount of optical flow, computed using a Horn-Schunck algorithm (CV Toolbox) from the stimulus movies. The different versions of the model were compared exploiting their prediction accuracy and the BIC. We discarded the models if, after addition of a new predictor, either their accuracy was decreasing or the BIC showed a decrease. Further details of the model fitting procedure are described in Appendix 1.

Computation of the tuning functions
The species-tuning functions were computed by marginalization of the discriminant functions belonging to the same species category along the variable e. The tuning function to monkey expressions as a function of the species parameter s was defined as D M s ð Þ ¼ R 1 0 P 1 e; s ð Þ þ P 2 e; s ð Þ ð Þde. Similarly, the tuning function for human expressions was given by D H s ð Þ ¼ Þde. For this function, the direction of the s-axis was flipped, so that the category center also appears for s = 0, just as for the function D M (s).

Equilibration of stimuli for amount of motion/deformation
In order to control the amount of expressive low-level information, that is, the total amount of motion or shape deformation, we generated sets of equilibrated stimuli. For this purpose, we first defined different measures for the low-level information content and balanced the stimuli by equilibrating these measures. Tested measures included ( Figure 5A) optic flow (computed with an optic flow algorithm) (OF), the maximum amount of deformation (projected to the plane) of the polygon mesh relative to the neutral pose (DF), and the (two-dimensional) motion flow of the polygon mesh integrated over time (MF). To control the information content of the stimuli, we generated morphs between the original motion and the trajectories of a neutral expression using our motion morphing technique. In these morphs, the original expression was weighted with the morph level l and the neutral expression with the weight 1 À l ð Þ. The parameter l was chosen to equate the low-level measures of all four prototypical stimuli, separately for the two avatar models (for the front view). For this purpose, we fitted the relationship between the individual measures M for the low-level information and the morphing parameter l by a logistic function of the form (a i signifying constants) The inverse of this function was used to determine the values of the morph parameter l that matched the value M of the most expressive prototype motion. The MF measure resulted in the least variability of the perceived expressiveness of the equilibrated stimuli (see 'Results'), and thus was used to equilibrate the stimuli for all experimental conditions.

Statistical analysis
Statistical analyses were implemented using MATLAB and RStudio (3.6.2), using R and the package lme4 for the mixed models of ANOVA. We used G*Power 1.3 software to compute a prior rough estimate of the minimum required number of participants for medium effect size.
Different GLMs for the modeling of the categorization data were fitted using the MATLAB Statistics Toolbox. Models for the discriminant functions, including different sets of predictors, were compared using a step-wise regression approach. Models of different complexity were compared based on the prediction accuracy and by exploiting the BIC.
Two statistical measures were applied in order to compare the similarity of the categorization responses for the two avatar types. First, we computed the ratio of the different vs. shared variance between the fitted discriminant functions. For this purpose, we first computed the average discriminant function across both the avatar types and the two view conditions, and separately for the different classes (the index k running over the avatar types and view conditions, and j indicating the class number): The ratio of the variance that is different and shared between the four conditions (avatars and views) is then given by the expression This ratio is zero if the discriminant functions across all four conditions are identical.
As second statistical analysis, we compared the multinomially distributed four-class classification responses across the participants for the individual points in morphing space using a contingency table analysis that tested for the independence of the class probabilities from the avatar types and the two view conditions. Statistical differences were evaluated using a 2 -test and, for cases for which predicted frequencies were lower than 5, we exploited a bootstrapping approach (Wilson et al., 2020).
The species-tuning functions, D H (s) and D M (s), were fitted by the sigmoidal function , with the parameter determining the threshold and !, the steepness. Differences of the tuning parameters were tested using two-factor mixed-model ANOVAs (species-specific of motion (monkey vs. human) as the within-subject factor and experiment (original motion, occlusion of the ears, and equilibrated motion) as the between-subject factor). Differences of the steepness parameters ! were tested using within-subject two-factor ANOVAs.

Acknowledgements
Special thanks to Tjeerd Dijkstra for the help with advanced statistical analysis techniques. We thank H and I Bü lthoff for helpful comments. This work was supported by HFSP RGP0036/2016 and EC CogIMon H2020 ICT-23-2014/644727,and European Research Council ERC 2019-SYG under EU Horizon 2020,. MG was also supported by BMBF FKZ 01GQ1704 and BMG: SSTeP-KiZ (grant no. ZMWI1-2520DAT700). RS, SS, PD, and PT were supported by a grant from the DFG (TH 425/12-2). Further support by NVIDIA Corp. The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Ethics
Human subjects: All participants participated voluntarily in our study and gave written informed consent about this, and about the use of their data in a scientific study. Data was used in a fully anonymised fashion, nowhere we published data from individual participants. All psychophysical experiments were approved by the Ethics Board of the University Clinic Tü bingen (886/2018BO2) and were conducted consistently with the rules of the Declaration of Helsinki. Animal experimentation: All animal procedures were approved by the local animal care committee (Regierungsprä sidium Tü bingen, Abteilung Tierschutz, permit number: N4/14) and fully complied with German law and the National Institutes of Health's Guide for the Care and Use of Laboratory Animals. A particular challenge was the development of a realistic skin model that specifies believable color and reflectance properties. Skin-surface textures were generated using photographs of a real monkey as the reference and by painting layer-wise color variations of the skin in order to approximate maximally its realistic appearance. Specifically, we used multiple layers of diffusive texture to model the translucent behavior of the skin, separately for the deep layer, subdermal layer, and epidermal layers (Appendix 1-figure 1D). For the deep layer, we hue-shifted the diffuse texture map toward red colors in order to model the deep vascularization, while the color of the subdermal layers was shifted more toward yellow in order to simulate the fatty parts of the skin. In order to mimic the very thin superficial layer of dead skin, we desaturated the diffuse texture for the epidermal layer. For realistic appearance, it was also important to model the specularity of monkey skin, reproducing how light is reflected from the skin within different facial regions. For this purpose, we created two specular maps for the monkey's face, one simulating the basic specularity and one describing the oiliness vs. wetness of the skin (Appendix 1-figure 1E). Both material channels have an Index of Refraction (IOR) of 1.375, corresponding to the IOR of water.

Decision letter and Author response
A final element that was essential for a realistic appearance was the realistic modeling of the fur. For this purpose, we exploited the built-in XGen Interactive Groom feature for hair creation of the animation software Maya. The overall appearance of the hair was controlled exploiting three control levels. The first level models the base of the fur, defining the direction of the hairs by control splines and adding some noise to model texture fluctuations and the matting of the fur. The second level models structures consisting of long hair, including the whiskers and the brows, using a smaller number of thick hairs. Believability and realism were increased further by adding a third layer of hair, also known as Peach Fuzz or Vellus, that consists of tiny hairs that are distributed within the face area. The final result is shown in Appendix 1-figure 1F.

Human avatar
The human avatar was based on a female face scan provided by the company EISKO ( Appendix 1- figure 2A). The commercial package also includes all main textures (diffuse map, specular map, base displacement map, etc; Appendix 1-figure 2B and C) for the neutral pose, as well as the corresponding textures for face compression and stretching (Appendix 1-figure 2B and C).
We applied just small color adjustments to the diffuse and specular maps, similar to texture creation of the monkey head model. The EISKO model package also included a whole face rig with 154 blend shapes, suitable for changing the face shape by blending (interpolation), resulting in naturally looking shape variations (Appendix 1- figure 2A). The interpolation was driven by control points equivalent to the ones in the monkey model, defined by the motion-captured markers exploiting a ribbon-like structure that was inspired by the human muscle anatomy (Appendix 1-figure 2F). Using a tension map algorithm, we determined the local deformations of the texture from the mesh deformations relative to the neutral pose. For the generation of high-frequency details, contrasting with the approach for the monkey avatar, we employed a multichannel texture package from TexturingXYZ. This package provides diffuse maps and high-frequency details as displacement maps (pores, wrinkles, etc) derived from a scanned real face. Exploiting the programs R3dS Wrap (trial version) and xNormal, we transferred shape details similar to the ones of the monkey face to the human face model (Appendix 1-figure 2G). Hair animation used the same tools as for the monkey face. The final result is shown in Appendix 1- figure 2H.

Motion capture
Motion capture was realized with a VICON FX20 motion capture system with six cameras (focal length, 24 mm) using a camera setting that was optimized for face capturing. We used 43 reflecting markers (2 mm) that were placed in the face, using a marker set that we developed ourselves ( Figure 1B and S1B). Recording frequency was 120 Hz. Trajectories were preprocessed using Nexus 1.85 software by VICON, smoothed and segmented by an expert into individual facial expressions with a duration between 3 and 5 s. The trajectories were resampled with 150-time steps and 30 fps.
The monkey expressions were recorded from a 9-year-old male animal. The expressions were elicited by showing the animal different objects, including a screw driver, a mirror, and an unknown male human individual. The animal was head-fixed and observed the stimuli at a distance of 200 cm in front of the camera set-up. The recorded trajectories were segmented by a monkey expert, who had extensive experience with macaque monkeys on a daily basis.
The human marker set was corresponding to the one of the monkey, except that it lacked markers on the ears (Appendix 1-figure 2D). The human actor was instructed to show two facial expressions 'anger' and 'fear'. Processing was identical to the marker trajectories of the monkey expressions.

Motion morphing algorithm
In order to create continuous parameterized spaces of facial movements, we exploited a motion morphing method that is based on a hierarchical Gaussian process model. The method is in principle real-time capable, thus allowing for instantaneous changes of motion style modulations based on the on-line user input. This functionality was not critical for the experiments presented in this paper, but it is used in ongoing experiments that build on the presented results.
Our motion morphing algorithm is based on a hierarchical probabilistic generative model that is learned from facial movement data. The architecture (Appendix 1- figure 3A) comprises three layers. The lower two layers are formed by Gaussian process latent variable models (GP-LVMs) (Lawrence and Moore, 2007), and the highest layer is formed by a Gaussian process dynamical model (GPDM) (Wang et al., 2008). They are binomially distributed. Plate notation indicates the replication of model components for the encoding of the temporal sequence, and the different styles. Nonlinear functions are realized as samples from Gaussian processes with appropriately chosen kernels (for details, see text). (B) Results from Turing test experiment. Accuracy for the distinction between animations with original motion capture data and trajectories generated by our motion morphing algorithm is close to chance level (dashed line), opposed to the accuracy for the detection of small motion artifacts in control stimuli, which was almost one for both avatar types.
The facial motion was given by the M-dimensional trajectories of the control points, which were parameterized by an N x D time series matrix Y = [y 1 , . . .,y N ] T with N = 600 (two expressions) or N = 900 (neutral expression included for equilibration) and D = 208 dimensions. The two GP-LVM layers reduce the dimensionality of the patterns in the high-dimensional trajectory space in a nonlinear way. For this purpose, the first layer represents the trajectory points as nonlinear functions of a lower-dimensional hidden state variable, specifying the N x M matrix H = [h 1 , . . .,h N ] T . In our case, the dimensionality M of the hidden variables h n was six. Signifying (y d ) T , the row vectors of the matrix Y, the trajectory components are modeled in the form where the variables " m specify independent Gaussian noise vectors, and with the function f 1 being drawn from a Gaussian process f 1~G P 0; k 1 h; h 0 ð Þ ð Þ, that is, all vectors of the T are distributed according to the Gaussian distribution N fj0; K ð Þ with the covariance matrix K, whose elements are specified by a kernel function k 1 in the form K nn 0 ¼ k 1 h n ; h n 0 ð Þ þ g 1 d nn 0 . The kernel function is given by a linear combination of two types of kernels, a radial basis function kernel and a linear kernel The RBF (Radial Basis Function) part allows to capture nonlinear structures in the data, while the linear kernel supports smooth and linear inter-and extrapolation in the pattern space. In addition, we found that the Kronecker delta part of the Kernel matrix is critical for the smoothness of the learned trajectories in the latent space. The parameter b 1 specifies the inverse width of the Gaussian radial basis functions.
The second layer of the model is defined exactly as the first layer. Here, the dimensionality of the variables h n is further reduced by generating a nonlinear mapping from the hidden state variables x n with Q = 2 dimensions, which defines the matrix X = [x 1 , . . .,x N ] T . Like in the first layer, the nonlinear mappings between the components of the variables x and h are defined by functions drawn from a Gaussian process f 2~G P 0; k 2 x; s; e; x 0 ; s 0 ; e 0 À Á À Á , where the hyper-parameters of the kernel function differ from the ones of the kernel k 1 . In addition, the kernel of this layer depends on the style vector variables e and s. These variables enter the kernel of the Gaussian process as multiplicative linear kernel terms The random variables e and s encode the motion style using one-out-of-two encoding, and they were estimated from the training data together with the state variables x n using a maximum-likelihood approach. This parametrization turns out to be favorable to separate the different style components and the motion content in the latent space, similar to multi-factor models (Wang and Fleet, 2007). We constrained the style vectors for all trials of the training data that represented the same motion style (e.g., 'expression 1', 'human motion') to be equal. In this way, the training data specify estimatesê 1 andê 2 that correspond to averages of the expression types 1 and 2, and estimatesŝ M andŝ H that correspond to the average monkey and the human expressions. In order to generate new intermediate motion styles, we ran the learned Gaussian model in a generative mode, fixing the values of these style vectors to blends between these estimates. The style vectors for the motion morphs as functions of the style parameters e and s, as discussed in the main part of the paper, were given by the relationships e ¼ eê 1 þ 1 À e ð Þê 2 and s ¼ sŝ M þ 1 À s ð Þŝ H , respectively. The highest level of the probabilistic model approximates the dynamics of the trajectories of the hidden state variables x n using a nonlinear extension of an auto-regressive model, which is known as GPDM. For this purpose, the state dynamics is modeled as a function of the two-dimensional hidden state variable x n that obeys the nonlinear dynamics where n is isotropic white Gaussian noise and where the nonlinear function f 3 is again specified by a Gaussian process. The hidden state dynamics can again be learned using a GP-LVM framework (Wang et al., 2008), where we used a kernel function of the form k 3 x nÀ1 ; x nÀ2 ; x n 0 À1 ; x n 0 À2 À Á ¼ g 6 exp Àb 3 x nÀ1 À x n 0 À1 2 Àb 4 x nÀ2 À x n 0 À2 2 þ g 7 x T nÀ1 x n 0 À1 þ g 8 x T nÀ2 x n 0 À2 : To determine the parameters of the GP-LVMs, we maximized the logarithm of their marginal likelihood and fitted all hyper-parameters using a scaled conjugate gradient algorithm (Møller, 1993). Since the evaluation of the marginal requires the inversion of a kernel matrix with a dimensionality that is given by all pairs of latent points, its direct implementation is computationally infeasible for large data sets. To render this inversion feasible, we applied a sparse approximation method that approximates the marginal distribution based on a low number of inducing points in the hidden spaces (Lawrence, 2007). The model was trained using six motion-captured example trajectories for each of the two basic human and monkey expressions, sampled with 150-time steps. Training using an AMD Ryzen Threadripper 1950X CPU with 32 cores with a clock frequency of 3.4 GHz took about 1.5 hr. The most important parameters of the algorithm are summarized in Appendix 1-table 2.

Turing test experiment
The described motion morphing algorithm interpolates between the original motion-captured movements in space-time. It was critical to verify that the morphing algorithm does not destroy the naturalness of the facial movements, at least for the prototypical expressions between which we blended. In order to verify this question, we realized a Turing test experiment that included 16 new participants. They had to discriminate between animations with original motion capture data ('original trajectories') and ones generated with movements that were generated by the morphing algorithm ('algorithm-generated trajectories'). The movements generated with the algorithm approximated the prototype movements (the style variables e and s being 0 or 1). In order to induce some variability, we used three different motion capture trials of each of the original human and monkey expressions, and their approximations based on the morphing model. The compared stimulus pairs were presented sequentially, and motions were presented in a block-randomized order 20 times, in separate blocks for the two avatar types. To verify that participants can pick up artifacts in the animations, we added a further condition in which instead of movements generated by the morphing algorithm we used control movements, which were generated by reversing the temporal order of short four-frame segments in the original motion-captured movements. Animations with these control movements also had to be distinguished from ones with the original motion capture data.
The results of this control experiments are shown in Appendix 1- figure 3B. The accuracy of the detection of original motion capture data, as opposed to the generated one, was 40.6% for the monkey avatar and 47.5% for the human avatar. Compared to the chance probability 0.5, both values were significantly lower ( 2 1; 16 ð Þ ¼ 18:18; p<0:001 for the monkey and 2 1; 16 ð Þ ¼ 11:43; p<0:001 for the human avatar). This implies that the animations using motion capture data were judged even less frequently as 'original trajectories' than the animations generated with our motion synthesis algorithm. The morphing algorithm thus does not degrade the perceived naturalness of the motion. The even higher perceived naturalness of the algorithm-generated motion likely is a consequence of the motion being slightly more smooth, due to the smoothing properties of Gaussian process models. The artificial control movements were detected with very high reliability, as indicated by the high accuracies 96.88%, for the monkey, and 96.56%, for the human avatars, which are highly significantly different from chance level ( 2 ð1; 16Þ ¼ 35:85; p<0:001 vs. 2 1; 16 ð Þ ¼ 34:93; p<0:001).

Comparison of different classification models
Different multinomial regression models were compared in order to find the most compact model that explains our classification data. The models differed in terms of the predictor variables of the linear model for the approximation of the variables y j . The six compared models were defined as . Model 1: ;j e Á s, and . Model 6: Apart from the style variables e and s, the variable OF signifies the optic flow computed from the image sequence with an optic flow algorithm. Models were compared based on two criteria. First, we required that the introduction of additional predictors did not result in a significantly higher prediction accuracy. According to this criterion, for almost all stimulus types, model four was the most compact model for the front view stimuli (Appendix 1-table 1). Only for the rotated views of the avatars, however, we found a slight significant increase of the prediction accuracy (by less than 1.57%). For this reason, we decided to use model four as the basis for our further analyses of all classification data in the main experiment.

Testing of low-level information that predicts expressivity
Since we found for natural dynamic expressions that a larger part of the tested perceptual space was classified as monkey than as human expressions ( Figure 3C), we suspected this result to be a potential consequence of monkey expressions specifying more salient low-level features, such as local motion or geometrical deformations. In order to control for this variable, we created a second stimulus set for which the amount of low-level information was balanced. Since it was a priori unknown which type of low-level information drives the expressivity of facial expressions, we tested nine possible measures, quantifying the amount of low-level features in a separate psychophysical experiment with nine participants. These measures were two-dimensional optic flow computed from the movies with a Horn-Schunck algorithm (MATLAB implementation), the absolute spatial deformation relative to the neutral frame, and the motion flow computed either from the control point trajectories or from the regularized mesh points, either in three dimensions or after projection to the twodimensional image plane.
The spatial deformation relative to the neutral frame was quantified using the measures where X t signifies a vector that contains the relevant control point or (two-or three-dimensional) mesh point coordinates and where N is the number of stimulus frames. Likewise, the motion flow was defined by the quantity For the true optic flow, the motion measure was computed by summing up the absolute values of all estimated local motion vectors across the image. The stimuli for this experiment were motion morphs between each of the four prototype expressions (two human and two monkey expressions) and a neutral expression. The original expression entered the motion morph with a weight of l, and the neutral expression with a weight of 1 À l ð ), where the morphing weight was adjusted to obtain the same amount of low-level information in all adjusted prototypes.
In order to cut down the number of measures for the amount of low-level information in the first place, we generated a set of face motion stimuli with reduced and exaggerated expressivity, separately for the two face avatars, by choosing six different values for the morphing weight l (values 0-25-50-75-100-125% for the monkey expressions and the values 0-37.5-75-112.5-150% for the human expressions). For all rendered movies, we computed the nine different measures for the lowlevel feature content and analyzed their dependence on the morphing weight l and their similarities. We found that the measures DF and MF, computed from the two-dimensional and three-dimensional mesh coordinates, and the control points were very highly correlated (r > 86.24, r average = 98.74; p < 0.0403). The mesh point-based measures were monotonically increasing functions of the morphing level l. This was not the case for the quantities computed from the control point trajectories, due to which we discarded the measures derived from the control point trajectories from the balancing of the stimuli. Because of the high correlation between the measures computed from the twoand three-dimensional mesh-point trajectories, and the higher similarity of the two-dimensional trajectories with image motion, we kept only the measures computed from the two-dimensional meshpoint trajectories for the further analysis. In addition, we tested the optic flow computed by the optic flow algorithm from pixel images as a third possible predictor of the low-level information. For each of these three predictors, we constructed a balanced stimulus set by adjusting the morph levels of all prototypes, except for the one with the lowest low-level feature content, in order to match their low-level information contents. As a result, we obtained three balanced sets of stimuli, each with four dynamic expressions, separately for each avatar type.
All stimuli were shown in a block-wise randomized order to the participants who had to rate their expressivity on a nine-point Likert scale. For the human avatar, the stimuli, the expressivity of which was balanced using the motion flow measure MF, showed the smallest variability across participants and the largest expressivity. For the monkey avatar, the expressivity was rated similarly for stimuli balanced using the measures MF and DF, while it was significantly lower for stimuli balanced using the optic flow (t(275) = 2.8; p = 0.0054 and t(269) = 3.95; p < 0.001). A step-wise regression analysis, in which we predicted the expressivity ratings from the remaining measures (MF and DF computed from the two-dimensional mesh motion), showed that the motion flow MF is sufficient, while the other predictor DF did not add significant additional information. Using a model comparison analysis exploiting the Bayesian Information Criterion (BIC), we found no significant difference in the explanatory values of the models including the predictor MF, and the predictors MF and DF ( 2 1; 284 ð Þ ¼ 3:49; p ¼ 0:062). Appendix 1-table 3. Detailed results of the two-way ANOVAs. ANOVA for the threshold: two-way mixed model with expression type as within-subject factor and the stimulus type as between-subject factor for both the monkey and the human avatar. Steepness: two-way ANOVA with avatar type and expression factor for each stimulus motion type (original, occluded, and equilibrated). The mean square is defined as Mean Square = Sum of Square/df; df = degree of freedom.