Congenital amusia is a disorder primarily of pitch perception and production that has a profound impact on musical processing, but only minor effects on speech processing (Ayotte, Peretz, & Hyde, 2002; Liu, Patel, Fourcin, & Stewart, 2010; Patel, 2008; Peretz, Ayotte, Zatorre, Mehler, Ahad, Penhune, & Jutras, 2002; Thompson, Marin, & Stewart, 2012). Recent research has suggested that the apparent domain specificity of congenital amusia can be explained partly by the following observations: First, individuals with congenital amusia only demonstrate reduced performance in speech processing when the pitch contrasts involved are relatively small (Hutchins, Gosselin, & Peretz, 2010; Jiang, Hamm, Lim, Kirk, & Yang, 2010; Liu et al., 2010; Liu, Jiang, Thompson, Xu, Yang, & Stewart, 2012; Nan, Sun, & Peretz, 2010; Patel, Wong, Foxton, Lochy, & Peretz, 2008); second, linguistic contexts and acoustic features other than pitch (e.g., duration, intensity) may provide additional cues for speech communication (Liu, Jiang, et al., 2012; Patel, Foxton, & Griffiths, 2005); and finally, the pitch-processing deficits in individuals with congenital amusia are more pronounced with discrete musical pitches than with gliding pitches in speech (Foxton, Dean, Gee, Peretz, & Griffiths, 2004; Liu, Xu, Patel, Francart, & Jiang, 2012). However, evidence is missing with regard to how the different functions of language and music may impact the domain specificity of congenital amusia. Pitch patterns in speech do not need to match a specified standard, but instead merely need to convey contrastive functional information (Xu 2005). By contrast, musical pitch must conform to specific conventions that apply to individual pitches as well as pitch patterns. In other words, the “form” taken by pitch patterns acts as a means of communication in speech, but is the intended end product for music (Patel, 2008). Understanding how musical versus linguistic pitch processing in congenital amusia is affected by the nature of music and language is useful for formulating a model of pitch processing in music and language that takes into account how impairments compromise auditory-processing skills in either domain. Considering four theoretical perspectives, in the present study we examined the characteristics of pitch and rhythm processing in speech versus song imitation in individuals with congenital amusia who speak a tone language, Mandarin.

The relationship between music and language

Much recent research has pointed to shared mechanisms between music and speech processing for individuals of different language and musical backgrounds (Bidelman, Gandour, & Krishnan, 2011; Hutchins, Gosselin, & Peretz, 2010; Jiang et al. 2010, 2012; Liu, Jiang, et al., 2012; Liu et al., 2010; Liu, Xu, et al., 2012; Mantell & Pfordresher, 2013; Nan et al., 2010; Patel, 2008, 2011, 2012a, 2012b; Pfordresher & Brown, 2009; Tillmann, Burnham, Nguyen, Grimault, Gosselin & Peretz 2011a; Tillmann, Rusconi, Traube, Butterworth, Umiltà & Peretz 2011b). However, the case study of a Polish-speaking poor-pitch singer (without pitch perceptual problems) in Dalla Bella, Berkowska, and Sowinski (2011) demonstrated domain-specific performance on pitch imitation in speech (intact) versus song (impaired). Although studies on a larger sample of English-speaking poor-pitch singers suggest that this dissociation may not generalize to the broader population of poor-pitch singers (Mantell & Pfordresher, 2013), it is unclear whether individuals with congenital amusia (with pitch perceptual problems) would demonstrate music-specific pitch production deficits due to the apparent domain specificity of congenital amusia (severely impaired musical perception; relatively spared speech perception) or domain-general pitch production deficits are associated with this disorder.

In the present study, we examined pitch production in congenital amusia through speech and song imitation among speakers of the tone language Mandarin, given that the Mandarin tonal system provides an ideal platform to match speech and song stimuli closely. Mandarin has four lexical tones and a neutral tone (Chen & Xu, 2006; Duanmu, 2007). In order to make speech stimuli most comparable to song stimuli (Liu, Xu, et al., 2012), only phonologically discrete Mandarin tones were used in the speech materials (i.e., high, Tone 1; low, Tone 3; and mid, the neutral tone). In order to account for the possible effects of interval size and rhythm pattern commonly used in speech versus music (Dowling & Harwood, 1986; Peretz & Hyde, 2003), we included two sets of song stimuli, one with pitch and rhythm patterns similar to those in speech (language-song hereafter), and the other closely resembling Western music (music-song hereafter). We predicted that individuals with congenital amusia would perform worse than controls on both speech and song imitation and that both groups would perform better on song imitation than speech imitation, due to the greater demand on pitch precision imposed by music than speech (Patel, 2008, 2011, 2012b) and because of the fact that when imitating speech materials, individuals tend to imitate the functional goal (e.g., statement/question in English) rather than the form of the utterances (Liu et al., 2010; Over & Gattis, 2010) unless instructed to focus on pitch (Mantell & Pfordresher, 2013).

Pitch/interval/contour processing in music and speech

Absolute pitch, interval (pitch distance between notes), and contour (pitch direction, up vs. down, between notes) play different roles in long-/short-term memory of melodies (Dowling & Bartlett, 1981; Dowling & Fujitani, 1971; Dowling & Harwood, 1986). Previous findings suggest that individuals with congenital amusia tended to produce pitches lower than the targets when imitating single pitches (Hutchins, Zarate, Zatorre, & Peretz, 2010), and they also showed pitch interval and contour errors in singing and pitch matching tasks (Dalla Bella, Giguère, & Peretz, 2009; Loui, Guenther, Mathys, & Schlaug, 2008; Wise, 2009). Although these individuals showed great difficulty recognizing, memorizing, and producing melodies without lyrics (Ayotte et al., 2002; Dalla Bella et al., 2009), it remains unclear which aspects of melodic processing underlie such difficulty: pitch, interval, and/or contour processing? The present study examined pitch/interval/contour processing in speech versus song imitation among Mandarin-speaking individuals with congenital amusia through detailed acoustic analyses. Given that imitation facilitates amusic singing (Tremblay-Champoux, Dalla Bella, Phillips-Silver, Lebrun, & Peretz, 2010) and automatic pitch processing is involved during imitation (Hutchins & Peretz, 2012a; Liu et al., 2010; Loui et al., 2008), we expected these individuals to perform at the normal level on contour processing, but to show reduced performance on pitch/interval processing in speech/song imitation.

Rhythm processing in music and speech

Previous studies have indicated that only around half of individuals with congenital amusia have impaired rhythm processing in music, in terms of their singing of a familiar song (Dalla Bella et al., 2009) and performance on the rhythm subtest of the Montreal Battery of Evaluation of Amusia (MBEA; Peretz, Champod, & Hyde, 2003), which consists of six subtests on scale, contour, interval, rhythm, meter, and memory processing of musical melodies. Other findings suggest that rhythm deficits in individuals with congenital amusia could only be revealed when test materials were musical (vs. noise bursts; Dalla Bella & Peretz, 2003) and when pitch variations were also involved (Foxton, Nandy, & Griffiths, 2006). The present study explored whether individuals with congenital amusia would show rhythm processing deficits in song and/or speech imitation. Given that rhythmic patterns in speech often resemble those in music (Patel, 2008; Patel, Iversen, & Rosenberg, 2006), on the basis of the findings in Foxton et al. (2006), we expected individuals with congenital amusia to demonstrate domain-general rhythm processing difficulties in speech and song imitation.

The relationship between perception and production

Congenital amusia presents a mixed picture in the relationship between musical perception and production: An association was found in some cases and dissociation in others. For example, when investigating singing in congenital amusia, Dalla Bella et al. (2009) found that for some individuals with congenital amusia, singing performance can be predicted by sensitivity to pitch changes: The poorer the pitch change detection, the worse the singing. However, in exceptional cases, proficient singing was also associated with severe perceptual deficits, and very poor singing with only mild perceptual deficits. An “action–perception mismatch” in congenital amusia was also observed in Loui et al. (2008), in which individuals with congenital amusia were able to imitate the correct direction of a heard pitch interval (intact production), despite their inability to report its direction (impaired perception). Nevertheless, a larger cohort of individuals with congenital amusia demonstrated mixed results in this regard (Williamson, Liu, Peryer, Grierson, & Stewart, 2012). Evidence of perception–production dissociation has also been seen in the speech domain, in which English- or French-speaking individuals with congenital amusia showed better performance on imitation than on identification/discrimination of speech intonation (Hutchins & Peretz, 2012a; Liu et al., 2010), and Mandarin-speaking individuals with congenital amusia demonstrated spared lexical tone production but impaired identification/discrimination (Nan et al., 2010). No research has yet examined whether the pitch perception deficit in congenital amusia would have similar effects on speech and music production when the linguistic and musical materials are closely matched.

In the present investigation, we compared the speech/song imitation abilities of individuals with congenital amusia with their scores on the Montreal Battery of Evaluation of Amusia (MBEA; Peretz et al., 2003) and their psychophysical perceptual thresholds for pitch change detection and pitch direction discrimination (see Table 1; see Liu, Jiang, et al., 2012, and Liu et al., 2010, for detailed task descriptions). Given that the deficits seen in the singing of individuals with congenital amusia cannot be solely attributable to their low-level pitch perception deficits (Dalla Bella et al., 2009), and given the demonstrated unconscious pitch processing during imitation as compared with the identification/discrimination of the same pitch events (Hutchins & Peretz, 2012a; Liu et al., 2010; Loui et al., 2008), we predicted that speech/song imitation in congenital amusia would be accounted for by both psychophysical pitch thresholds and melodic perception abilities.

Table 1 Characteristics of the amusic (n = 13) and control (n = 13) groups

Method

Participants

A group of 13 Mandarin-speaking individuals with congenital amusia and 13 matched controls were recruited via advertisements in the bulletin board systems in Beijing, China. The Montreal Battery of Evaluation of Amusia (MBEA) was used for diagnosis of congenital amusia in these individuals (Henry & McAuley, 2010; Peretz et al., 2003), with those having a total score of 65 or under (out of 90 trials) on the three pitch-based subtests (scale, contour, and interval) identified as having this musical disorder (Liu, Jiang, et al., 2012; Liu et al., 2010; Liu, Xu, et al., 2012; Thompson et al., 2012; Williamson & Stewart, 2010; Williamson et al., 2012). At the time of testing, all participants were enrolled as undergraduate or Master’s students at universities in Beijing with Mandarin Chinese as their native language. In the questionnaire regarding their music, language, and medical/biological background, none of the participants reported having learning or memory problems with their studies, or any neurological/psychiatric disorders or speech/hearing difficulties. None had received formal extracurricular musical training. Table 1 shows the characteristics of these participants, as well as their scores on the MBEA subtests (Peretz et al., 2003) and their thresholds for pitch change detection and pitch direction discrimination (Liu, Jiang, et al., 2012; Liu et al., 2010). As can be seen, the two groups were comparable on all background measures, but individuals with congenital amusia performed worse on the MBEA and obtained higher pitch thresholds than did controls.

Stimuli

As in the procedure followed by Mantell and Pfordresher (2013), the stimuli here were constructed by creating sung tones with pitch patterns and text settings derived from naturally produced speech. Natural speech and song stimuli were recorded in two separate sessions in a soundproof booth at Goldsmiths, University of London, by a 27-year-old Mandarin-speaking female student (the target speaker, hereafter). Born and raised in Beijing, the target speaker was an amateur singer/songwriter with 16 years of musical training. In total, 20 Mandarin sentences were used as the speech stimuli, each containing two to six syllables (Table 2).

Table 2 Stimuli used in the experiment

Acoustic analyses of the speech stimuli were done using Praat (Boersma & Weenink, 2011), with the F 0 (fundamental frequency) and duration of each syllable being extracted using the ProsodyPro script (Xu, 2005–2012). Discrete complex-tone (F 0 plus seven odd harmonics) analogues of these stimuli were then created with a custom-written Praat script, using the technique described in Patel, Peretz, Tramo, and Labreque (1998), to serve as the targets (to be imitated by the target speaker) of the language-song stimuli. These complex-tone sequences followed the rhythmic patterns of the speech stimuli but contained discrete pitches in the Western pitch class that were closest to the median F 0s of the speech syllables.

In the follow-up recording session for the song stimuli, the target speaker was presented with the auditory stimuli of the complex-tone analogues and the written scripts of the speech materials (in Chinese) and was instructed to reproduce/sing the pitches of the complex tones on the speech syllables while her voice was recorded. During recording, the target speaker made spontaneous adjustments to the rhythmic patterns of the complex-tone analogues in order to make the production more song-like. This set of recordings led to language-song stimuli, which featured musical pitches in the chromatic scale that were nearest to the median F 0s of the speech syllables. Consequently, these songs were atonal and contained larger pitch intervals than are common in Western music. In order to create tonal melodies that adhered more closely to the Western diatonic scale, the target speaker was requested to improvise and record another set of songs (music-song stimuli) that approximated the global melodic contours of the stimuli in speech and language song. Figure 1 shows the F 0 (in semitones, or st) contours of a set of speech/song examples (accessible at https://sites.google.com/site/fangliuproject/sound-examples), in which black dots represent the target productions, and green diamonds and red squares represent imitations by a control (C08) and an amusic (A08) participant, respectively. As can be seen, although the rhythms in the language-song stimuli mirrored but were proportionally slower than the speech rhythms, the music-song stimuli were mostly isochronous. It is worth noting that neither the language-song nor the music-song stimuli contained vibrato.

Fig. 1
figure 1

F 0 contours [in semitones, or st; st = 12 * log2(Hz), Hz = 2(st/12), with 1 Hz as the reference frequency] of a set of speech/song examples produced by the target speaker (black dots) and participants C08 (green diamonds) and A08 (red squares): (a) speech, (b) language-song, and (c) music-song. The Mandarin sentence is 冬天的风? (“Dong1tian1 de0 feng1?”; “The wind in the winter?”), where 1 denotes the Mandarin high tone and 0 the neutral tone. Note that although the tones all had level pitches in the phonological/underlying forms, the surface representations may not be flat, because of various articulatory constraints (Xu & Wang, 2001)

In order for participants of different genders to imitate target stimuli of the same gender, the three sets of recorded speech/song stimuli were synthesized into natural-sounding female (preserving the absolute pitches and formant frequencies of the original recordings) and male (changing the original pitches to one octave lower and shifting the frequencies of the original formants by .78 so as to achieve male voice characteristics) voices, using the “change gender” command in Praat. None of the participants commented that either the female or the male voice sounded unnatural, and no significant differences were found in imitation performance between the participants of different genders for either the amusic or the control group. Therefore, the syntheses of the female/male target stimuli were unlikely to have caused any adverse effects on imitation performance.

Table 3 displays the acoustic characteristics of the three types of stimuli in the female target (the values in the male target were 12 semitones lower). Paired t tests (shown in Table 3) indicated that the speech and language-song stimuli did not differ significantly in absolute pitch (measured as the median F 0 of each syllable rhyme, in semitones) and pitch interval (the absolute difference in median F 0s between two consecutive syllable rhymes, in semitones), whereas the music-song stimuli generally had higher absolute pitches and smaller pitch intervals than did the speech and language-song stimuli. Whereas the speech stimuli on average had the shortest syllable durations (length of each syllable rhyme, in milliseconds) and interonset intervals (interval between the onsets of two consecutive syllable rhymes, in milliseconds; IOIs, hereafter), the music-song stimuli featured the longest syllable durations and IOIs.

Table 3 Acoustic characteristics of the three types of stimuli in the female target production

Procedure

The experiments were conducted in a quiet room at the Institute of Psychology, Chinese Academy of Sciences in Beijing, China. Ethical approval was granted by both Chinese Academy of Sciences and Goldsmiths, University of London. Written informed consent forms were obtained from all participants. In previous studies, English-speaking individuals with congenital amusia had shown normal laryngeal control (the contact phase regularity of vocal fold vibration) and pitch production (overall pitch range and pitch regularity) when reading a story and producing three sustained vowels (Liu et al., 2010), and lexical tone production of Mandarin-speaking individuals with congenital amusia achieved near-perfect recognition rates by native listeners (Nan et al., 2010). On the other hand, French- and English-speaking individuals with congenital amusia have been shown to have problems with singing (Ayotte et al., 2002; Dalla Bella et al., 2009; Tremblay-Champoux et al., 2010). On the basis of the expected level of difficulty of the tasks, the three imitation tasks were administered to the participants in the order of (1) speech imitation, (2) language-song imitation, and (3) music-song imitation, with the easiest task presented first. These tasks were separated by approximately 15-min gaps, during which the participants carried out the tone/intonation perception tasks as described in Liu, Jiang, et al. (2012). The fixed order of presentation of the speech/song imitation tasks was unlikely to have an impact on the present results for the following reasons. First, no consistent stimulus type effect was observed on imitation performance across the different measures reported in the Results section. Second, experiments that have been set up to look at change in performance across repetitions of the same trial have shown no effect of simple repetition (i.e., no improvement) on pitch matching (Hutchins & Peretz, 2012b) or speech/song imitation (Wisniewski, Mantell, & Pfordresher, in press).

The presentation of the target stimuli and the recording of the imitations were both done using Praat. Four practice trials, with items different from those in experimental trials, were given before the speech imitation task to familiarize the participants with the tasks and procedure. Speech/song stimuli were presented one at a time in pseudorandom order (the same across the three tasks) to the participants, who were required to imitate the pitch and time patterns of the utterances/melodies as closely as possible while their voice was recorded. The participants were encouraged to imitate each stimulus immediately following its presentation, although they could request that the experimenter (author F.L.) replay the stimulus if it was unclear, or repeat the imitation if there was disfluency (both of which rarely happened).

Data analysis

All acoustic analyses were conducted on the imitation data using the Praat script ProsodyPro (Xu, 2005–2012). Given that musical beats in singing are usually synchronized with the vowel onsets, rather than the initial consonants, of the sung notes (Sundberg & Bauer-Huppmann, 2007), syllable/note duration was calculated as the length of the syllable rhyme, and the onset of syllable rhyme was defined as the syllable/note onset time. The median F 0s of the syllable rhymes were extracted in order to indicate pitch heights. Adapting the acoustic measurements in previous singing or pitch-matching studies (Dalla Bella et al., 2011; Dalla Bella, Giguère, & Peretz, 2007, 2009; Pfordresher & Brown, 2007; Pfordresher, Brown, Meier, Belyk, & Liotti, 2010; Ward & Burns, 1978), the following pitch and time variables were calculated so as to examine imitation accuracy of the participants.

The absolute pitch deviation (in semitones) of the imitated syllable/note from the target was the absolute difference in median F 0 between the two. Each participant had 20 values for each stimulus type, averaged across two to six syllables/notes in the 20 utterances/melodies. The bigger the value, the less accurate the imitation in terms of absolute pitch matching.

The pitch interval deviation (in semitones) was the absolute difference between the imitated pitch interval (difference in median F 0 between two consecutive syllables/notes; in absolute value) and the target pitch interval. Each participant had 20 values for each stimulus type, averaged across one to five intervals in the 20 utterances/melodies. The bigger the value, the less accurate the imitation in terms of relative pitch matching.

The signed interval deviation (in semitones) was the signed difference between each imitated interval (in absolute value) and the corresponding target interval (in absolute value). Each participant had 60 values for each stimulus type, as the 20 utterances/melodies contained 60 intervals in total. Negative deviations indicate interval compressions, and positive deviations suggest interval expansions. This measure was used for examining patterns of interval compressions/expansions across the three stimulus types for the amusic and control groups, and not for measuring imitation accuracy per se.

The number of contour errors (out of the total 60 contours for each stimulus type) was the number of imitated pitch intervals that constituted different pitch directions (up, down, or level) than the corresponding target pitch intervals. Pitch direction was considered to be up or down if the difference in median F 0 between two consecutive syllables/notes was higher or lower by one semitone or more. If the difference in median F 0 between two consecutive syllables/notes was within one semitone, the two syllables/notes were considered to form a level/flat pitch direction.

The number of pitch interval errors (out of the total 60 intervals for each stimulus type) was the number of imitated pitch intervals that were larger or smaller than the corresponding target pitch intervals by one semitone. As in Dalla Bella et al. (2007, 2009, 2011), pitch interval errors were counted without considering whether pitch direction errors also occurred. Namely, imitated and target pitch intervals were compared using absolute values, which ignored co-occurring contour errors (if any).

The duration difference (in milliseconds) between the imitated syllable/note and the target was the absolute difference in rhyme length between the two. Each participant had 20 values for each stimulus type, averaged across two to six syllables/notes in the 20 utterances/melodies. The bigger the value, the less accurate the imitation in terms of absolute time matching.

The IOI difference (in milliseconds) was the absolute difference in interonset interval between two consecutive syllables/notes of the imitated and target productions. Each participant had 20 values for each stimulus type, averaged across one to five IOIs in the 20 utterances/melodies. The bigger the value, the less accurate the imitation in terms of relative time matching.

The number of time errors (out of the total 80 syllables/notes for each stimulus type) was the number of imitated syllables/notes that were at least 25 % longer or shorter than the corresponding target syllables/notes (Dalla Bella et al., 2007, 2009).

It is worth noting that—as compared with the absolute pitch deviation, pitch interval deviation, signed interval deviation, and duration/IOI differences—the numbers of contour/pitch interval/time errors are relative measures that are not necessarily affected by the characteristics of the target stimuli.

All statistical analyses were conducted using R (R Development Core Team, 2012). Two-way repeated measures analyses of variance (ANOVAs) were conducted to assess the main effects of group (amusic, control) and stimulus type (speech, language-song, music-song) on imitation accuracy, as well as their interaction. Generalized eta-squared was the measure of effect size calculated (Bakeman, 2005). The “glht” function (“Simultaneous Tests for General Linear Hypotheses”) in the R package “multcomp” was used for post-hoc analyses, using Tukey contrasts for multiple comparisons of the means (Hothorn, Bretz, & Westfall, 2008). Kendall’s rank correlation tau (τ; two-sided) was used for the correlation analyses. In order to examine whether, and to what extent, target pitches/intervals (durations/IOIs for time variables) affected each group’s pitch/time matching accuracy, linear mixed-effects models were fit on the individual syllables/notes (which comprised 80 pitches/durations and 60 intervals/IOIs in total for each stimulus type) using the lme4 package for R (Baayen, Davidson, & Bates, 2008), with group (amusic, control) and target pitches/intervals/durations/IOIs as fixed effects, and individual participants and items as random effects. It is worth mentioning that it is inappropriate to use analysis of covariance to examine the effect of stimulus type with target pitches/intervals/durations/IOIs as covariates, since the two are not independent, and the assumption of the homogeneity of regression slopes for amusic and control groups was not met (Miller & Chapman, 2001).

Results

Absolute pitch deviation

Figure 2 shows box plots of the absolute pitch deviations (in semitones) of the two groups in the three imitation tasks, in which each participant had 20 values for each stimulus type, averaged across two to six syllables/notes in the utterances/melodies. The two-way repeated measures ANOVA revealed a significant main effect of group [F(1, 24) = 8.03, p = .009, η p 2 = .24]. Individuals with congenital amusia produced significantly larger absolute pitch deviations than did controls across all three tasks (post-hoc comparisons between groups: speech, z = 2.47, p = .01; language-song, z = 2.93, p = .003; music-song, z = 2.87, p = .004). The main effect of stimulus type was also significant [F(2, 48) = 26.28, p < .001, η p 2 = .07], with both groups showing better absolute pitch matching for language-song and music-song stimuli than for the speech stimuli (post-hoc analyses for individuals with congenital amusia: language-song vs. speech, z = −3.78, p < .001; music-song vs. speech, z = −3.81, p < .001; controls: language-song vs. speech, z = −6.91, p < .001; music-song vs. speech, z = −10.44, p < .001). Controls also showed better absolute pitch matching in music-song than in language-song imitation (z = −3.54, p = .001). No significant Group × Stimulus Type interaction was found for absolute pitch deviations.

Fig. 2
figure 2

Absolute pitch deviations (in semitones) of individuals with congenital amusia and controls in speech (a), language-song (b), and music-song (c) imitation. Each participant had 20 values for each stimulus type, averaged across two to six syllables/notes in the 20 utterances/melodies. These box plots show the extreme of the lower whisker, the lower hinge of the box, the median, the upper hinge, and the extreme of the upper whisker. The two hinges are the first and third quartiles, and the whiskers extend to the most extreme data points, which are no more than 1.5 times the interquartile range from the box. The data points that lie beyond the extremes of the whiskers are outliers, denoted by small open circles

Linear mixed-effects models of the absolute pitch deviations (with group and target pitches as fixed effects and individual participants and items as random effects) indicated that target pitch heights were negatively associated with the absolute pitch deviations in all three tasks (speech, t = −6.00, p < .001; language-song, t = −2.72, p = .007; music-song, t = −2.45, p = .01; the higher the target pitch, the smaller the absolute pitch deviation). A significant Group × Target Pitch interaction was found for speech imitation (t = 2.13, p = .03; the negative effect of target pitch on absolute pitch deviation was stronger for individuals with congenital amusia than controls), but not for language-song or music-song imitation.

Pitch interval deviation

Figure 3 shows box plots of the pitch interval deviations (in semitones) of the two groups in the three tasks. The two-way repeated measures ANOVA revealed significant main effects of group [F(1, 24) = 12.76, p = .002, η p 2 = .27] and stimulus type [F(2, 48) = 33.00, p < .001, η p 2 = .30], but no Group × Stimulus Type interaction on the pitch interval deviations. Post-hoc analyses suggested that individuals with congenital amusia had significantly larger pitch interval deviations than did the controls across all three tasks (speech, z = 1.99, p = .047; language-song, z = 3.30, p < .001; music-song, z = 4.14, p < .001). Both groups showed better relative pitch matching for music-song than speech or language-song stimuli (individuals with congenital amusia: music-song vs. speech, z = −6.86, p < .001; music-song vs. language-song, z = −4.78, p < .001; controls: music-song vs. speech, z = −10.54, p < .001; music-song vs. language-song, z = −4.64, p < .001). Although controls also showed better relative pitch matching in language-song than speech imitation (z = −5.90, p < .001), this difference was only marginally significant for individuals with congenital amusia (z = −2.09, p = .09).

Fig. 3
figure 3

Pitch interval deviations (in semitones) of individuals with congenital amusia and controls in speech (a), language-song (b), and music-song (c) imitation. Each participant had 20 values for each stimulus type, averaged across one to five intervals in the 20 utterances/melodies

Linear mixed-effects models of pitch interval deviations on group and target interval revealed a significant main effect of target interval (speech, t = 15.10, p < .001; language-song, t = 10.88, p < .001; music-song, t = 8.87, p < .001) and a significant Group × Target Interval interaction (speech, t = −3.96, p < .001; language-song, t = −5.61, p < .001; music-song, t = −3.60, p < .001) in all three tasks, as we observed a positive association between target interval and interval deviation (the larger the target interval, the greater the interval deviation), and this association was stronger in individuals with congenital amusia than in controls.

Signed interval deviation

Figure 4 shows the mean signed interval deviations (and standard errors) of the two groups against the target intervals (rounded to integers) for speech, language-song, and music-song imitation. Repeated measures ANOVAs were conducted on the signed interval deviations for each stimulus type, with Target Interval as the within-subjects factor and Group as the between-subjects factor. Given that theoretically only interval expansions are possible for the Target Interval 0 semitone (Dalla Bella et al., 2009), this interval size was not included in the ANOVA models. A significant group difference was found for language-song imitation [individuals with congenital amusia, mean (SD) = −.97 (1.76); controls = −.46 (1.25); F(1, 24) = 5.29, p = .03, η p 2 = .09], but not for speech or music-song imitation. We found a significant main effect of target interval for all three types of stimuli [speech, F(10, 240) = 26.07, p < .001, η p 2 = .32; language-song, F(12, 288) = 48.15, p < .001, η p 2 = .58; music-song, F(6, 144) = 70.67, p < .001, η p 2 = .71], with larger target intervals generally leading to greater interval compressions in imitation. The Target Interval × Group interaction was also significant for all three types of stimuli [speech, F(10, 240) = 1.92, p = .04, η p 2 = .03; language-song, F(12, 288) = 4.25, p < .001, η p 2 = .06; music-song, F(6, 144) = 4.45, p < .001, η p 2 = .11], as the two groups demonstrated different degrees of interval expansions/compressions across the spectrum of the target interval sizes.

Fig. 4
figure 4

Mean signed interval deviations (in semitones; with standard errors) of individuals with congenital amusia (red dashed lines) and controls (black straight lines) against target intervals (rounded to integers, in semitones) in speech (a), language-song (b), and music-song (c) imitation. Each participant had 60 values for each stimulus type, as the 20 utterances/melodies contained 60 intervals in total. Negative deviations indicate interval compressions, and positive deviations suggest interval expansions

Number of pitch interval errors

Figure 5 shows the numbers of pitch interval errors made by the two groups out of the total of 60 pitch intervals in each stimulus type. The two-way repeated measures ANOVA revealed significant main effects of group [F(1, 24) = 13.75, p = .001, η p 2 = .28] and stimulus type [F(2, 48) = 26.63, p < .001, η p 2 = .27], and a significant Group × Stimulus Type interaction [F(2, 48) = 4.11, p = .02, η p 2 = .05], on numbers of pitch interval errors. Post-hoc analyses suggested that the group effect was only significant for language-song (z = −3.14, p = .002) and music-song (z = −4.52, p < .001) imitation, but not for speech imitation. In addition, individuals with congenital amusia achieved fewer interval errors in music-song than in speech imitation (z = −3.04, p = .007), and controls’ pitch interval errors were significantly different across the three tasks (speech > language-song, z = 4.15, p < .001; speech > music-song, z = 7.51, p < .001; language-song > music-song, z = 3.36, p = .002).

Fig. 5
figure 5

Numbers of pitch interval errors (out of the total number of 60 pitch intervals in the speech/song stimuli) of individuals with congenital amusia and controls in speech (a), language-song (b), and music-song (c) imitation. Individual participants (13 in each group) are represented by black dots, with those at the same horizontal level having identical values, and those lying beyond the whiskers being outliers (which are further indicated by open circles along the midline)

Generalized linear mixed-effects models revealed a positive association between number of pitch interval errors and target interval in all three tasks (speech, z = 9.27, p < .001; language-song, z = 7.23, p < .001; music-song, z = 5.00, p < .001), as both groups made more pitch interval errors when the target intervals were relatively large.

Number of contour errors

Figure 6 shows numbers of contour errors made by the two groups out of the total 60 contours in each stimulus type. No significant group effect or Group × Stimulus Type interaction was observed. The main effect of stimulus type was significant [F(2, 48) = 54.98, p < .001, η p 2 = .48], as both groups made significantly fewer contour errors with music-song than with speech/language-song stimuli (individuals with congenital amusia: speech > music-song, z = 6.17, p < .001; language-song > music-song, z = 4.92, p < .001; controls: speech > music-song, z = 8.03, p < .001; language-song > music-song, z = 6.98, p < .001).

Fig. 6
figure 6

Numbers of contour errors (out of the total number of 60 contours in the speech/song stimuli) of individuals with congenital amusia and controls in speech (a), language-song (b), and music-song (c) imitation. Individual participants (13 in each group) are represented by black dots, with those at the same horizontal level having identical values, and those lying beyond the whiskers being outliers (which are further indicated by open circles along the midline)

Generalized linear mixed-effects models revealed a negative association between numbers of contour errors and target intervals for both speech (z = −6.47, p < .001) and language-song (z = −6.03, p < .001) imitation, but not for music-song imitation. That is, for both groups, smaller pitch intervals in the target production were more likely to lead to contour errors in the imitation than were larger pitch intervals in both speech and language-song imitation, but not in music-song imitation.

Duration difference

Figure 7 illustrates the duration differences (in milliseconds) between target and imitation by the two groups in the three tasks. The two-way repeated measures ANOVA revealed a significant main effect of stimulus type [F(2, 48) = 137.07, p < .001, η p 2 = .73] and a significant Group × Stimulus Type interaction [F(2, 48) = 3.79, p = .03, η p 2 = .07] on duration differences. The main effect of group was only marginally significant [F(1, 24) = 4.22, p = .051, η p 2 = .08]. Post-hoc analyses suggested that individuals with congenital amusia showed significantly larger duration differences than did controls in speech (z = 2.14, p = .03) and music-song (z = 2.15, p = .03) imitation, but not in language-song imitation. Both groups showed the smallest duration differences in speech imitation and the biggest duration differences in music-song imitation (individuals with congenital amusia: speech < language-song, z = −11.61, p < .001; speech < music-song, z = −32.77, p < .001; language-song < music-song, z = −21.16, p < .001; controls: speech < language-song, z = −13.83, p < .001; speech < music-song, z = −30.22, p < .001; language-song < music-song, z = −16.39, p < .001).

Fig. 7
figure 7

Duration differences (in milliseconds) of individuals with congenital amusia and controls in speech (a), language-song (b), and music-song (c) imitation. Each participant had 20 values for each stimulus type, averaged across two to six syllables/notes in the 20 utterances/melodies

Linear mixed-effects models revealed a positive association between target duration and duration difference in all three tasks (speech, t = 7.97, p < .001; language-song, t = 15.31, p < .001; music-song, t = 15.53, p < .001), but this association was weaker for controls than for individuals with congenital amusia in music-song imitation (Group × Target Duration, t = −3.06, p = .002).

IOI difference

Figure 8 illustrates the IOI differences (in milliseconds) between target and imitation by the two groups in the three tasks. The two-way repeated measures ANOVA revealed significant main effects of group [F(1, 24) = 13.34, p = .001, η p 2 = .15] and stimulus type [F(2, 48) = 52.96, p < .001, η p 2 = .60] on the IOI differences, but no Group × Stimulus Type interaction. Post-hoc analyses suggested that individuals with congenital amusia showed significantly larger IOI differences than did controls in language-song (z = 2.27, p = .02) and music-song (z = 2.43, p = .02) imitation, but the difference was only marginally significant in speech imitation (z = 1.78, p = .08). Both groups showed the smallest IOI differences in speech imitation and the biggest IOI differences in music-song imitation (individuals with congenital amusia: speech < language-song, z = −3.96, p < .001; speech < music-song, z = −17.32, p < .001; language-song < music-song, z = −13.36, p < .001; controls: speech < language-song, z = −4.21, p < .001; speech < music-song, z = −15.63, p < .001; language-song < music-song, z = −11.42, p < .001).

Fig. 8
figure 8

Interonset interval (IOI) differences (in milliseconds) of individuals with congenital amusia and controls in speech (a), language-song (b), and music-song (c) imitation. Each participant had 20 values for each stimulus type, averaged across one to five IOIs in the 20 utterances/melodies

Linear mixed-effects models revealed a positive association between target IOI and IOI difference in all three tasks (speech, t = 2.76, p = .006; language-song, t = 6.49, p < .001; music-song, t = 6.52, p < .001), and this association was weaker for controls than for individuals with congenital amusia in language-song imitation (Group × Target Duration, t = −2.34, p = .02).

Number of time errors

Figure 9 shows the numbers of time errors (out of 80) made by the two groups during speech/song imitation. The main effect of group was significant [F(1, 24) = 8.65, p = .007, η p 2 = .17], as controls made fewer time errors than did individuals with congenital amusia across the three tasks (speech, z = −2.36, p = .02; language-song, z = −1.75, p = .08; music-song, z = −2.50, p = .01). No significant effect of stimulus type or Group × Stimulus Type interaction was observed.

Fig. 9
figure 9

Numbers of time errors (out of 80) of individuals with congenital amusia and controls in speech (a), language-song (b), and music-song (c) imitation. Individual participants (13 in each group) are represented by black dots, with those at the same horizontal level having identical values, and those lying beyond the whiskers being outliers (which are further indicated by open circles along the midline)

Generalized linear mixed-effects models revealed a negative association between number of time errors and target duration for speech imitation (z = −2.94, p = .003; the shorter the target duration, the greater the number of time errors), but a positive association between number of time errors and target duration for music-song imitation (z = 6.70, p < .001; the longer the target duration, the greater the number of time errors). No significant effect of target duration on number of time errors was observed for language-song imitation.

Correlations between imitation performance, MBEA scores, and pitch thresholds in individuals with congenital amusia

In order to investigate the relationship between production and perception in congenital amusia, correlation analyses were conducted between imitation performance and MBEA scores in individuals with congenital amusia (controls’ data are omitted in the interest of space). Negative correlations suggest that better scores on the MBEA (i.e., number of correct responses out of 30) were associated with better speech/song imitation performance (i.e., smaller values of pitch/time variables). First, MBEA scale scores were negatively correlated with pitch interval deviations (τ = −.45, p = .04) and numbers of pitch interval (τ = −.59, p = .006) and contour errors (τ = −.54, p = .02) in speech imitation. Second, negative correlations were observed between MBEA interval scores and duration difference (τ = −.50, p = .02) and number of time errors (τ = −.57, p = .01) in speech imitation. Third, MBEA rhythm scores were negatively associated with duration difference (τ = −.48, p = .03) and number of time errors (τ = −.49, p = .03) in speech imitation. Fourth, negative associations were observed between MBEA meter scores and pitch interval deviations (τ = −.48, p = .02) and number of pitch interval errors (τ = −.62, p = .004) in speech imitation, and between MBEA meter scores and duration difference (τ = −.56, p = .008), IOI difference (τ = −.48, p = .02), and number of time errors (τ = −.50, p = .02) in music-song imitation. Finally, MBEA memory scores were negatively correlated with absolute pitch deviations in speech (τ = −.43, p = .04) and music-song (τ = −.43, p = .04) imitation, with numbers of contour errors in music-song imitation (τ = −.58, p = .01), and with IOI difference in language-song imitation (τ = −.51, p = .02).

Correlation analyses between pitch thresholds and imitation performance in individuals with congenital amusia revealed a positive correlation between pitch direction discrimination thresholds and IOI difference in language-song imitation (τ = .61, p = .005): The higher (worse) the thresholds, the worse the relative time matching.

Discussion

The relationship between music and language

In the present study, we investigated pitch and rhythm processing in speech versus song imitation among Mandarin-speaking individuals with and without congenital amusia in order to examine whether the functional differences between music and language would have any impact on pitch/rhythm processing in either domain. The finding of reduced speech and song imitation abilities in individuals with congenital amusia provides further evidence for shared mechanisms between music and language processing (Liu et al., 2010; Patel, 2008, 2012a). Given the important role that imitation plays in phonological development (Plaut & Kello, 1999), the observed speech imitation impairment in congenital amusia seems rather surprising, since it might potentially hinder the language (in this case, Mandarin) acquisition of these individuals. Nevertheless, individuals with congenital amusia rarely report language problems in everyday life (in English, Liu et al., 2010; in Mandarin, Jiang et al., 2010). The apparent paradox may be explained by the different natures of speech and music: Speech is function-driven, and music is form-driven (Patel, 2008). In particular, pitch patterns in speech are used for representing functional contrasts (e.g., lexical tone/stress, focus, sentence modality, etc.), and as such their execution only needs to satisfy contrastive adequacy (Xu, 2005). For music, musical understanding or communication relies on pitch accuracy and aesthetics, which are obvious aspects to be perfected in performances (Patel, 2008, 2011, 2012b). Indeed, research has demonstrated that the exact control of F 0 is “unnecessary” in speaking but “preferable” in singing (Natke, Donath, & Kalveram, 2003; Patel, 2012b). The present results are consistent with such claims, in that imitation of song was generally more accurate than imitation of speech with respect to pitch and time for individuals with or without congenital amusia (see also the similar results of Mantell & Pfordresher, 2013, for English-speaking individuals who do not have congenital amusia). That is, although both groups had mean absolute pitch deviations and pitch interval deviations above one semitone in speech imitation, controls’ absolute/relative pitch deviations in song imitation were on average below one semitone, whereas those of individuals with congenital amusia were close to or above one semitone. Therefore, it seems that although neither group was very accurate in speech imitation, controls achieved increased accuracy for song imitation, whereas individuals with congenital amusia were unable to do so, as evidenced by the significant Group × Stimulus Type interaction on number of pitch interval errors in speech/song imitation (the group effect was only significant for language-song and music-song imitation, but not for speech imitation).

Pitch/interval/contour processing in music and speech

The individuals with congenital amusia in the present study showed reduced performance relative to controls on both absolute (absolute pitch deviation for all three tasks) and relative (pitch interval deviation for all three tasks, number of pitch interval errors for language-song and music-song) pitch matching in speech/song imitation. Acoustic analyses revealed a positive association between target interval and interval deviation (the larger the target interval, the greater the interval deviation), and this association was stronger in individuals with congenital amusia than in controls. This indicates that these individuals were more likely than controls to compress large pitch intervals in both speech and song imitation (as evidenced by the results on signed interval deviations in the two groups; Fig. 4).

The present findings also suggest that the reduced performance on speech/song imitation in individuals with congenital amusia was due mostly to inaccurate pitch and interval processing, but not to inaccurate contour processing. This is consistent with the results of Loui et al. (2008), but not with Dalla Bella et al. (2009) and Wise (2009). Note that Pfordresher and Brown (2007) also found no differences between good- and poor-pitch singers with respect to contour errors. This discrepancy may be due to task (imitation vs. singing from memory) or stimulus (lyrics vs. tones) differences among these studies.

Rhythm processing in music and speech

When singing a familiar song from memory, individuals with congenital amusia have been shown to perform similarly to controls in terms of tempo, number of time errors, and rubato consistency, although they showed greater temporal variability than did controls (Dalla Bella et al., 2009). In the present study of speech/song imitation, the group effect was found to be significant for both IOI difference (significance for language-song and music-song, and marginal significance for speech) and number of time errors (significance for all three tasks). Furthermore, the significant Group × Stimulus Type interaction on duration difference (the main effect of group was marginally significant) suggests a significant group effect on duration differences in speech and music-song imitation, but not in language-song imitation. It is worth noting that IOI differences in the present study measured localized relative time matching between individual imitated IOIs and target IOIs, which is equivalent to the measures of neither tempo (mean IOI of the quarter note) nor temporal variability (coefficient of variation of quarter-note IOIs) in Dalla Bella et al. (2009). The discrepancy between the present findings and those in Dalla Bella et al. (2009) concerning the number of time errors made by individuals with congenital amusia may result from the familiarity of the song materials used in the two studies. Namely, it may be that reduced time-matching abilities in individuals with congenital amusia are more likely to be revealed when singing or imitating unfamiliar speech/song materials (the present study) than when singing or imitating familiar ones (Dalla Bella et al., 2009).

It is worth noting that although both groups in the present study showed greater duration and IOI differences in song than in speech imitation (music-song > language-song > speech), this finding may not necessarily imply that absolute and relative time matching during speech imitation were superior to song imitation for both groups. Two of our results motivate this interpretation. First, no significant main effect of stimulus type was observed on number of time errors (a relative measure of time matching) for either group. Second, we found a positive association between the target duration/IOI and duration/IOI differences in all three tasks. Given that music-song stimuli contained the longest target durations/IOIs among the three stimulus types (Table 3), it is likely that the largest duration/IOI differences in music-song imitation were caused by the positive associations between target durations/IOIs and duration/IOI differences. This effect simply replicates the well-known association between target duration and timing variability in production (e.g., Wing & Kristofferson, 1973). Interestingly, the association between target duration and duration difference was weaker for controls than it was for individuals with congenital amusia within the music-song imitation condition. Similarly, the association between target IOI and IOI difference was weaker for controls than for individuals with congenital amusia within the language-song imitation condition. The fact that control participants were less strongly affected by target duration may partly explain their superior performance on time matching in speech/song imitation.

The relationship between perception and production

The extent to which singing and pitch-matching abilities can be predicted by pitch perception thresholds is a debated issue (Amir, Amir, & Kishon-Rabin, 2003; Bradshaw & McHenry, 2005; Dalla Bella et al., 2007, 2009; Hutchins & Peretz, 2011, 2012b; Nikjeh, Lister, & Frisch, 2009; Pfordresher & Brown, 2007). Upon observing the complex pitch production and perception associations and dissociations in congenital amusia, Dalla Bella et al. (2009) concluded that amusic singing could not be accounted for by a fine-grained pitch discrimination deficit alone. The results from the present study further support this conclusion, as the speech/song imitation performance of individuals with congenital amusia was largely associated with their scores on the MBEA melodic perception tests, but rarely with their pitch change detection or pitch direction discrimination thresholds (see the details in the Results section). Acoustic analyses of the speech/song imitation data suggest that, although like controls, individuals with congenital amusia were more likely to make contour errors on smaller target intervals than on larger ones (especially in speech and language-song imitation), both groups made more pitch interval errors when the target intervals were relatively large (across the three tasks). Furthermore, individuals with congenital amusia showed a stronger positive association between target interval and pitch interval deviation than did controls: The larger the target interval, the greater the pitch interval deviation (mostly due to interval compression, as in Dalla Bella et al., 2009; Pfordresher & Brown, 2007). These findings indicate that the pitch imitation deficits of individuals with congenital amusia cannot be explained solely by their impaired abilities to discriminate fine-grained pitch changes, and thus the core deficit of amusia may go beyond low-level pitch-processing impairments (Patel, 2008; Patel et al., 2005).

Finally, although the present study did not measure cognitive abilities such as working memory capacity, the findings are unlikely to have resulted from the possible differences in cognitive ability between the two groups for the following reasons. First, individuals with congenital amusia demonstrated working memory capacities comparable to those of controls (Williamson & Stewart, 2010). Second, ranging from two to six syllables/notes, our speech/song stimuli were relatively short verbal sound sequences, for which individuals with congenital amusia show normal short-term memory (Tillmann, Schulze, & Foxton, 2009). However, future studies will be required in order to explore whether individual differences in cognitive abilities such as working memory are associated with musical and speech-processing abilities in congenital amusia.

Conclusion

The present study is the first to report the reduced speech and song imitation abilities of individuals with congenital amusia, despite the fact that these individuals are proficient speakers of a tone language, Mandarin. The domain-general pitch/time production deficits in congenital amusia provide a new line of evidence for the shared mechanisms in pitch/time processing between language and music. However, similar to controls, individuals with congenital amusia demonstrated better pitch matching in song than in speech imitation, suggesting that the apparent domain specificity of congenital amusia may partly be due to the different functions that music and language serve in everyday life. That is, pitch patterns in speech are used for representing functional contrasts. For music, pitch accuracy is a crucial requirement for musical communication. Therefore, although individuals with congenital amusia are able to imitate pitches more accurately in singing than in speaking, the degree of precision is still not enough for music processing, but is already sufficient for speech processing.