Categorical emotion recognition from voice improves during childhood and adolescence

Converging evidence demonstrates that emotion processing from facial expressions continues to improve throughout childhood and part of adolescence. Here we investigated whether this is also the case for emotions conveyed by non-linguistic vocal expressions, another key aspect of social interactions. We tested 225 children and adolescents (age 5–17) and 30 adults in a forced-choice labeling task using vocal bursts expressing four basic emotions (anger, fear, happiness and sadness). Mixed-model logistic regressions revealed a small but highly significant change with age, mainly driven by changes in the ability to identify anger and fear. Adult-level of performance was reached between 14 and 15 years of age. Also, across ages, female participants obtained better scores than male participants, with no significant interaction between age and sex effects. These results expand the findings showing that affective prosody understanding improves during childhood; they document, for the first time, continued improvement in vocal affect recognition from early childhood to mid- adolescence, a pivotal period for social maturation.

childhood. Further, we hypothesized that this improvement continues during adolescence. This would be in accordance with functionalist views of adolescent developments putting forward an adaptive refinement of social skills during this period of life 54 and with the accruing data showing protracted maturation of the brain circuits involved in social and emotional signals perception rev. in 36 . In addition, in line with what has been described for faces 3,8 and for prosody in children 27,29,34,45 , we expect different trajectories for the different emotions: happiness is likely to be recognized with adult's accuracy at a younger age and sadness at a later age. Lastly, using a large enough sample we wanted to test whether the female advantage reported by Allgood and Heaton 48 can be reproduced.
Effect of testing procedure. In order to pool all participants into a single analysis, we aimed to verify first that there was no effect of testing procedure nor interaction with other factors. We first estimated a mixed-model in the children-adolescents sample, including the continuous factor age and categorical factors sex (male/female) and testing procedure (computer/ paper) with random intercept for subjects. There was no effect of testing procedure [Z-Wald = 0.2334; p = 0.81) nor interaction with the other factors (with age, Z = −0.179, p = 0.85; with sex: z = −0.19, p = 0.85). Removing this factor from the model did not alter the fit [Chi-square (4) = 0.75, p = 0.9447]. In addition, when we modeled separately data from the sample tested on computer and the sample tested on paper, with age and sex as factors, we observed similar parameter estimates in the two analyses and in the full model. Thus, the conclusions of our study hold whether we consider the full sample or only subsamples of participants tested with one or the other method. Therefore, for simplification we present now analysis of the full sample without the factor "testing procedure".
Effect of age and sex. The effect of age was highly significant (Z = 5.52, p = 3.4 × 10 −8 ). Adding a quadratic term (age 2 ) improved the model fit: Chi-sq(1) = 5.14, p < 0.023, indicating a non-linear developmental trajectory with faster improvement with age in childhood, followed by slower changes and plateau in late adolescence. The model with a cubic term did not converge. Based on the values predicted by the model at different ages (converted from logit to percent correct responses), the adults' mean level of performance is reached between age 14 and 15.
The effect of sex was also significant in the full model (Z = −2.017, p = 0.043). This was explained by females having significantly higher scores than males (83.2 +/− 6.2% vs 80.3 ± 6.7%, post-hoc t(223) = 3.27, CI = [1.1: 4.5]; p = 0.0012). There was no interaction between sex and age (comparison models with and without interaction term: Chi-square (1) = 0.021, p = 0.88), indicating similar age related trends in boys and girls. Looking at the model predictions separately for boys and girls showed that adults' level of performance (mean for female adults: 88.6 +/− 9.9% correct; mean for male adults: 85 +/− 8.7%) is reached between 14 and 15 years of age in both sexes (Fig. 1).

Effect of emotion.
Including the factor Emotion in the mixed model logistic regression showed a highly significant effect of emotion (Z = 12.3, p < 10 −16 ), in addition to the effects of age (Z = 6.6; p < 10-4), age-square (Z = −2.36, p = 0.018) and sex (Z = −2.025, p = 0.043). There was no interaction between emotion and sex (Z = 0.014, p = 0.98). The interaction between emotion and the linear effect of age was not significant (comparison models with and without interaction term: Chi-square (1) = 2.41, p = 0.12). The interaction between emotion and the quadratic term was significant (comparison models with and without interaction term: Chi-square (1) = 6.73, p = 0.00094). To explore this interaction, we fitted a logistic linear model on the percent correct recognition score for each emotion separately ( Fig. 2 and Table 1). This revealed significant age-related effects for all the emotions with the quadratic term being significant only for sadness.
Across the whole age range, happiness was the emotion recognized with the highest accuracy, followed by sadness and fear at around the same level while anger was the least well recognized ( Fig. 3 and Table 3). Improvement within the age span was smaller for happiness and greatest for anger and fear. To characterize this effect, we compared the four emotions pairwise in five different age-groups (children, pre-adolescents, early adolescents, late adolescents and adults; see methods). The results are displayed in Table 2 and Fig. 3. Anger was significantly different from all the other emotions in all age-groups except in adults where it did not differ from fear. Happiness was significantly better recognized than all other emotions in pre-and early adolescents, but the difference between happiness and fear was not significant in adults and the difference between happiness and sadness was not statistically significant in adults and children. Performances for fear and sadness stimuli did not differ in the youngest age groups, but in late adolescents fear was better recognized than sadness and the reverse in adults.
Second, we looked at how each emotion was miscategorized for each of the others, in the incorrect trials. Figure 4 and Table 3 show the confusion matrices in the four age groups and in adults. We observe relatively similar patterns of confusion across development. Anger was miscategorized most often as fear by children aged 5-8, and as sadness by adolescents. Fearful vocalizations were also sometimes miscategorized as angry by younger children as well as adults, while this confusion was less apparent in adolescents. Instead, adolescents showed a tendency to categorize sad vocalization as angry.

Discussion
We report for the first time the developmental trajectory of emotion perception from voice during childhood and adolescence. In a sample of 225 participants spanning an age-range from 5 to 17, we show that the ability to identify emotion from short affective bursts, in a forced-choice task, improves slowly but significantly with age, reaching adults' level around the age of 15. The developmental curve is better characterized by a non-linear function, with faster improvement during childhood followed by slower changes and plateau during adolescence. Across ages, female participants recognized emotion slightly more accurately than male participants with no interaction between age and sex effects. Whether participants were tested on individual computers or in small groups with response given on paper did not affect performance nor the observed developmental trajectory. We discuss these findings and their implications in relation to the existing literature on the maturation of non-verbal signals perception.
First, our findings significantly expand the literature on affective bursts recognition during childhood. They comfort, with a different population and material, the results published by Matsumoto et al. 47 , Chronaki et al. 27 and Allgood and Heaton 48 , who reported an improvement in vocal affect recognition between 5 and 10 years of age. In contrast, Sauter et al. 29 did not observe any developmental differences in this age range, except for the emotion surprise, when participants had to categorize short vocalizations such as screams or laughers. In this latter study, which had a much smaller sample size, the young participants displayed very high accuracy, leaving little room for improvement. Maybe the stimuli employed appeared exaggerated and were therefore recognized better by younger children; the same participants had more difficulty identifying emotion from single-words (digits) inflected speech and, in this case, a difference between younger and older participants was apparent. This concurs to establish that maturation of affective voice processing takes place during childhood, although further investigation would be needed to characterize the acoustical parameters that facilitate emotion recognition by young children. It must be noted that the stimuli used in our study, taken from the Montreal Affective Voices database, have been validated for their ecological value, their reliability and for yielding similar intensity judgments as emotional stimuli from standard databases of facial and bodily expressions 55 .
In addition, here we show that the improvement in vocal affect recognition continues beyond childhood into adolescence, and we are able to model this trajectory. Such protracted development is unlikely to be linked to age-related changes in the processing of low-level acoustic cues that differs across emotions. Indeed, the ability to process frequency variations or pitch, which differentiate these emotions 49,56 , has been reported to be adult-like in children aged about 5 57 . Nonetheless, late development has been also observed in other aspects of voice processing: Mann et al. 58 reported that the ability to recognize familiar voices and to form new memories of voices improved until 14 years of age. Some studies investigating sentence prosody recognition also suggest an improvement during adolescence: pooling together data from several studies using one of the most standardized measures of receptive nonverbal communication 34 and its version using children voices as stimuli (DANVA2-CP), points towards a linear improvement in affective prosody identification between 4 and 19 years Table 2 59 . This is in agreement with a recent study showing that 13-15 years old were outperformed by adults for the recognition of affective sentence prosody, even more so in the more difficult condition of decoding emotion from sentences spoken by children 60 . It is also consistent with a study using sentences with non-words, i.e. mimicking phonological and morphosyntactic properties of language but without distinguishable semantic information and showing that 11-13 years old children performed much worse than adults, for recognizing emotions from stimuli derived from their native (English) as well stimuli from a foreign languages 46 . In this study, however, the 11-13 years old were not better than 8-10 years old, therefore suggesting no development during late childhood. It may be that the small sample size would have prevented the observation of small differences in performance between 10 and 13 years old; indeed our data show that the maturation curve is less steep than in early childhood. In addition, it is possible that the context of being exposed to pseudo-sentences from different languages adds another level of complexity, which might impact more the pre-and early adolescents, who may automatically search for meaning and thereby be distracted from the task. Indeed it has been shown that when other contextual information is present, young children, but also 13-years old individuals, tend to rely less on prosody and more on semantic information than adults to infer the speaker's emotion 32 . This might explain differences between studies, but also reflect the fact that vocal emotion discrimination might still be difficult at the beginning of adolescence. The choice of stimulus material and confounds related to different linguistic materials might also explain while some individual studies using sentences as stimuli have reported no change in prosody processing beyond late childhood 4,5,28 .
Our data using non-linguistic vocalizations clearly show that vocal affect categorization progresses throughout childhood and well into adolescence. This developmental timeline is similar to the one reported by many studies of affective facial 4,7-9 or bodily 61 expressions recognition. Such changes in social perception abilities are in line with social challenges and refinement that occur during this period of life: as social relationships become more complex, individuals need to be able to detect and categorize subtle social cues more accurately. Improvement in social signals reading also coincides in time with structural and functional changes in brain areas involved in processing emotional signals, such as the amygdala 62 or the ventral prefrontal cortex [63][64][65][66] , as well as posterior cortical areas involved in social cognition, including voice processing 38,42,43 . Additional studies are needed to relate regional changes in brain activity and structure to behavioral changes in emotion processing. Nonetheless the protracted development of social abilities, at the cerebral and behavioural level, is in line with evolutionary theories which propose that the adaptive value of adolescence, as a distinct developmental stage, is to enable the maturation of refined social perception and social skills abilities 67,68 .
The observed female advantage for emotional categorization echoes numerous studies in adults and children focusing on different channels of emotion communication, including faces 7,52,53,69-71 , body movements 72,73 , voice and prosody 49,[74][75][76] . Although robust in meta-analyses 20,21 , this effect is small and not always apparent in individual studies. Its expression and amplitude may depend upon factors intrinsic to the stimuli or contexts 77 . In particular, Sauter et al. 29 and Chronaki et al. 27 , who used also non-linguistic affective vocalisations did not report any statistically significant sex difference in affective voice processing in children, in contrast to our ' and Allgood et al. 's results. The larger interindividual variability in children, compared to adolescents and adults, may also have masked small differences. In addition, Chronaki et al. 27 used only three emotions, happiness, sadness and anger; in our dataset, the sex effect was larger for fear recognition in the youngest participants, which contributed to the overall statistical significant effect. Interestingly the highest gender differences for fear recognition, as compared to other emotions, mirrors what has been reported in developmental studies of facial affect recognition 10 . Crucially, we observed no interaction between sex and age effects: that is the difference between boys and girls did not change with age. This developmental stability might reflect some evolved differences between males and females present, at least partly, very early in ontogeny. This is in line with earlier meta-analyses of facial or multimodal affective expression recognition 53 and thus compatible with integrative models suggesting that scaffolding and/or socialisation factors would contribute to maintaining gender differences in non-verbal signal processing that appear during infancy 20 . Moreover the observation of the same developmental trajectory in boys and girls indicates that the improvement in recognizing emotions from vocalizations is independent of pubertal and hormonal changes, which occurs earlier for girls.
Not all basics emotions were identified with equal accuracy. In fact, age-related changes, although significant for all emotions, varied with emotion. Across age groups happiness was the easiest emotion to recognize. This mimics what has been described in adults 49 and children 27,47 . This is also consistent with the easiest recognition of happiness from facial expressions by adults and children alike 6,10,78 . This high score for happiness might be explained, in part, by the fact that it was the only positive emotion and thus could potentially be categorized using only valence information. The high accuracy of sadness recognition contrasts with previous reports of vocal affect recognition showing that young children had difficulty recognizing sadness 27,47 and reporting a slow development for the decoding of this emotion. This discrepancy could be related to the nature of the stimuli: in our study, sadness stimuli consisted of relatively salient cries lasting about 2 seconds, and which were also recognized with relatively high accuracy in adults 49 . In contrast, stimuli in Chronaki's study 27 were shorter (700 ms) which may have made them more ambiguous and difficult to decipher. In accordance with this hypothesis developmental studies of affective prosody decoding from sentences report high accuracy for sadness in children, pre-adolescents 29,46 and adolescents 60 .
The two emotions that showed the most significant improvement with age were anger and fear, two negative-valence threat-related emotions. Also, anger proved to be the most difficult emotion to recognize with the oldest adolescents still having lower scores than adults, often miscategorizing angry expressions as sad. This is consistent with some developmental studies of emotion recognition from vocal 47 or facial expressions 10,78 . However, this finding contrasts with other studies in adults or younger children showing a superior recognition for angry voices as compared to other emotions such as sadness or fear 5,27,29,49 . From a behavioural point of view, in theory, one could argue that a reduced sensitivity to threat signals could promote social exploration and expansion of interpersonal relationships and thus be beneficial in late childhood and adolescence. The poor performance of children for fear recognition and the steeper development for this emotion is striking. It is in line with a couple of developmental studies on processing affective prosody from sentences in children and adolescents 5,46,60 .In addition, meta-analyses indicate that the association between psychopathy and poor fear processing from prosody is stronger in children and adolescents compared to adults 18 . Yet they are based on very few studies and lack comparison data. Our results could serve as normative data to assess vocal emotion recognition, from non-linguistic material, in pathological populations and eventually identify vulnerability periods and/or prognostic markers, for example for psychopathy 18 or schizophrenia 79 . Likewise, the present study could serve as a starting point to investigate how perceptual (e.g. hearing threshold, or sensitivity to individual acoustic cues), personal or behavioural (e.g. IQ, social anxiety or attention) characteristics could affect vocal emotion recognition across development.
Lastly, we observed the same range of performance as well as the same developmental trajectory in the two groups of participants, namely those tested with sounds presented on individual headphones and response given on keyboard vs. those tested with sounds presented on loudspeakers and responses given by circling labels on a pre-formatted paper sheet. Firstly, this signifies that children and adolescents were not influenced by the mode of sound presentation and response while expressing their judgments. Secondly, it shows that data from  Table 1. Results from the mixed model logistic regression analyses for each emotion separately. The linear and quadratic effects of age, the effect of sex and interaction between age and sex effects are evaluated by comparing the fit of the logistic model with and without these factors 38 . developmental studies conducted with different procedures, computerized or not, can be directly compared. The same conclusion has been reached in a large scale (n > 1500) study of facial expressions recognition, where strikingly similar developmental trajectories were observed, for all emotions investigated, in sub-samples tested in school on a paper version, in the lab on a computerized version or online 7 . The same conclusion was also apparent in a meta-analysis of developmental studies of facial affect recognition 21 . Thus, the choice of computerized or other conventional laboratory methods should not matter in the design of new developmental studies, thus facilitating the collection of larger datasets. In summary, our results demonstrate, for the first time, that the ability to identify basic emotions from non-linguistic elementary affect bursts continues to mature from childhood until mid-adolescence. This is akin to what has been described for the recognition of emotion through other channels, like faces and bodies. Insight into this protracted development has implications to characterize adolescents' interpersonal relationships and to shed light on the development of more complex social competences, such as learning from others or social feedback processing 80 . Charting the normal developmental path of vocal emotion perception has also critical implications for enhancing our understanding neurodevelopmental disorders, like schizophrenia, psychopathy or autism, for which abnormal voice processing has been proposed as an underlying factor of social anomalies 18,81 .

Methods
Participants and procedure. Participants were recruited from several after-school clubs, a private school and a college in the greater Glasgow area (Scotland). Permission was obtained from the relevant authorities. Parents as well as children were informed of the purpose of the study and signed a consent form. The study was approved by the University of Glasgow College of Science and Engineering Ethics committee and all experimenters had appropriate disclosure from the Scottish government to work with children. All research was performed in accordance with relevant guidelines from the Declaration of Helsinki, recommendations from the Economic and Social Research council of the United Kingdom and local regulations in particular with regards to data anonymization and storage.
We collected complete datasets from 225 participants (104 males) aged 5-17 (mean 11.87; sd 3.34). Based on existing literature in vocal and facial emotion perception recognition this sample is adequate to detect linear age-related changes as well as gender effects 48,82 with small to medium effect size. One hundred and thirty-three were tested in small groups (three to 10 participants), with the sounds being presented through loudspeakers and participants reporting their responses on a preformatted sheet of paper with one line per trial. They had to circle or tick, to their choice, the box corresponding to their response. The other 92 participants were tested on individual laptops, with the sounds presented through headphones and responses made through keyboard presses.  In both cases the responses were made by selecting one box out of four boxes labeled with the emotions happy, fearful, angry or sad displayed as emoticons on a horizontal line. All participants were familiar with the matching between the emoticons (i.e. facial expression cartoons) and individual emotions. Four different spatial orders were used and counterbalanced across participants to avoid response biases due to the location of the response boxes. All participants completed five practice trials before the experiment. For the youngest participants (<10 years old) the experimenters took extra care in ensuring that the task was well understood and asked them to report a situation in which they would experience each of the emotions to confirm that they knew the meaning of each label. The experimenter stayed close to the children in all cases, monitoring that they had no difficulty selecting the labels (with pencil or keypress) and that they moved smoothly from one trial to the next. As a control group, we tested 30 adults (age 20-47, 16 males). They were recruited through our local subjects' pool and tested on individual laptops.
Stimuli and design. Stimuli were taken from the Montreal Affective Voices databases (MAV 49 ). They consist of nonverbal affective bursts uttered by five male and five female speakers selected out of 22 actors to yield maximum recognition agreement amongst young adult listeners. We thus had ten exemplars of each emotion (anger, fear, happiness and sadness) and each participant was presented randomly with five stimuli for each emotion. In a four-alternative forced choice task, they were asked to select the emotion they thought the speaker was expressing.

Analysis.
We assessed the effect of age on the proportion of correct labeling choice with mixed-model logistic regression 83 chapter 84 . We used the lme4 toolbox (version 1.0-5) in R 85 . We entered each response as a binary variable (i.e. correct or incorrect) into the regression model with random effect "subject" and fixed effects "age" (centered around the sample mean), "sex", and "testing procedure" (computer or paper) as well as the interaction terms. We tested both linear and non-linear effects of age by adding polynomial terms to the regression model. Significance of main effects or interaction is reported in terms of the Wald statistic. Fixed-effects were also further evaluated using chi-square tests on the log-likelihood values to compare models with and without the effect of interest. From the model yielding the best fit, we computed the predicted percent correct responses at each age by inverting the logit link using the "fitted" function in R.
For additional description of the data we grouped the participants in four different age-groups, which we also compared to the adults group: children: 5-8 years old, n = 41 (25 females); early adolescents: 9-11 years old, n = 67 (28 females); mid-adolescents: 12-14 years old n = 54 (30 females); late-adolescents: 15-17 years old, n = 63 (38 females). There was no significant difference in gender balance between the groups (Chi-square = 5.8799, df = 3, p-value = 0.1176). For all comparisons between groups we used the Holm-Bonferonni method to account for multiple comparisons. We also computed confusion matrices in these age groups, reflecting the patterns of misinterpretation amongst the different emotions. That is, for each emotion portrayed we computed the number of times it was misclassified for each of the other emotions.
The datasets generated during the current study are available from the corresponding author upon request.