Defining “happy” in happy birthday: Singing accuracy a construct based on range and intervals

Singing in tune, or singing accuracy, is a construct dependent on genre, key selection, singers’ ranges, and listener expectations. But across genres, singers are expected to sing in tune with themselves and with others. Because singing a familiar song may be the most common communal singing situation in the lives of non-professional singers and students, we chose to examine previously reported data on the Happy Birthday song as representative of one typical example of a popular song globally and used in previous research. The participants were adult non-professional singers. We designed a predictive model of performance on the large ascending octave based on initial interval performance, providing implications for music education in children or adults and by exploring variables such as range, voice register, and key modulation. We conclude that while initial performance may predict subsequent performance, features such as singer’s range and starting pitch may also account for variability in individual performance.

scales for singing accuracy, which emphasize degree of tunefulness and ability to follow a melodic contour (Wise & Sloboda, 2008). Researchers using acoustic measures must choose thresholds of cent deviation to categorize in-tune singing, and have chosen demarcations of 25 cents (e.g., Nichols & Wang, 2016) or even 50 cents (e.g., Greenspon et al., 2017), though musicians may vary in tuning expectations according to levels of performance (youths vs. professionals) and cultures (tuning systems).
Previous research introduced two specific aspects of poor pitch singing: singing accuracy and singing precision. Singing accuracy referred to the average difference between sung and target pitches. Singing precision, by contrast, referred to the consistency of repeated attempts to produce a pitch. Singing accuracy and precision measures were found to be correlated, and analysis suggested that accuracy predicted precision more so than the converse. Results from one study indicated that singers were more accurate than they were precise, and most of the participants were categorized as imprecise singers (Pfordresher et al., 2010). Some frequently applied techniques for assessing singing proficiency included different tasks (e.g., singing from memory vs. imitation), thresholds for categorization including 25 cents, 50 cents, or 100 cents, or variable criteria, and measures (accuracy vs. precision), and imprecise singing was not always associated with inaccurate singing. In any study, the number of poor-pitch singers could vary as a function of the measure of singing proficiency as well as the threshold employed for defining poor-pitch singers (e.g., Dalla Bella, 2015).
Correlates with singing accuracy are presumed to exist in the singing development of children such as age or gender. Previous research focusing on the pitch-matching ability of students in Grades 1 through 6 in response to adult female, adult male, and child vocal models showed that the first-grade and sixth-grade children sang the least number of "correct" responses for all three models. When echoing an adult female model, both girls and boys sang more correct responses; also, more correct responses to the child model suggested children tended to match a voice similar to them (Green, 1990). Longitudinal data of children's singing ability in age 5, 6, and 7 years (Welch et al., 1997) indicated a gender difference appeared at age 7, in which girls performed slightly better than boys. Overall, such differences are dependent on task type in regard to pitch matching vs. song singing. Most importantly, specific intervals have been shown to evince varying levels of difficulty (Wolf, 2005), though it is challenging to predict song difficulty based on individual intervals.
For adults, subjective explanations for singing performance explain that many people who claimed to have difficulties in singing and were uncomfortable when they had to sing at public gatherings such as birthday parties or holiday celebrations may actually sing accurately (Pfordresher & Brown, 2007). Others suggested a similar challenge that in spite of increasing evidence that singing is natural, widespread, and probably deeply rooted in human biology, adults tended to underestimate the ability to sing in tune (Dalla Bella, 2015). Children's modulations during the singing of the Happy Birthday song were more a result of vocal discomfort than an inability to sing with a tonal center (Price, 2000), suggesting that modulations and occasional mistunings may be the result of song features more often than inaccurate singing: performance ability is variably indicated by the intervals or song chosen and the tonal center chosen relative to an individual's voice range. Indeed, performance is highly dependent on the task presented (Nichols, 2016), and a voice range appropriate to the key selection and song range is required.
Several explanations exist to describe why individuals may not sing desirably in tune. Individuals may mis-learn pitch sequences, though many songs exist for ceremonial, patriotic and religious purposes that are heard across the lifespan. Individuals may learn inaccurate information as quickly as they learn accurate information (as discussed in Yarbrough et al., 1991). Presumably, such familiar songs are heard correctly, though among these, some songs are more frequently heard by accurate soloists, such as the national anthem at a ballgame. Researchers have suggested that it was plausible that poor-pitch singers mis-produce the intervals between pitches in actual songs in addition to difficulty in vocally matching individual pitches; furthermore, poor-pitch adults also appeared to be hindered rather than helped by singing with correct accompaniment (Pfordresher & Brown, 2007). Finally, children have sung more accurately doubled by another voice than solo (Nichols, 2016) and doubled by another voice than by a piano timbre (Nichols, 2020). In sum, individuals acquire song material in different ways and are helped or hindered by various contexts.
The assumption that singing a familiar song would evince more accurate singing than an unfamiliar song is faulty if the familiar song was not learned accurately. Indeed, there were no significant differences found when students were compared singing a long-familiar song with singing a newly learned song when taught accurately (Guerrini, 2006). Furthermore, children did not seem to improve their performance when singing a familiar song over time, though other evidence supports the general improvement of pitch matching accuracy in young children over time (Demorest et al., 2018). Additionally, tonal patterns presented in a melodic or harmonic context were performed more accurately than those presented in isolation (Wolf, 2005). Based on those results, major and minor modality, melodic contour, and harmonic function (tonic and dominant) did not determine the difficulty of a tonal performance pattern. The difficulty was determined by intervallic relationships, length of patterns, and range. Specifically, children's performance was more accurate on easy tonal performance patterns which included low-range, two-notes, and thirds intervallic relationship. It is unclear how and why popular, familiar tunes are not performed well, though the specific features of a song offer implications for its difficulty level.
For singing any criterion song, the selection of a key center, which determines the placement of a song's range in the singer's own range, represents a confound for singing accuracy. Without a reference pitch, participants preferred to sing familiar songs near the lower end of their vocal range, regardless of the original key in which they learned the songs (Moore, 1991). When tessituras were compared to vocal ranges, the top quarter of the singing range tended to be rarely used, if not at all. Data indicated that all subjects from Moore's research, regardless of age, might unwittingly sing familiar songs in the bottom part of their potential singing range, that is, 5 semitones above their lowest vocal limit and 10 semitones below their highest vocal limit, unless they were reminded to lift their voices into the upper register. Adult subjects sang significantly closer to the lower than upper end of their singing range. The implications of indivduals' range, key selection, and unique pitch sequences in songs are important to group singing whether the group seeks to establish a common key.
Because singing a familiar song may be the most common communal singing situation in the lives of non-professional singers and students, we chose to examine the Happy Birthday song as representative of one typical example of a popular song globally and used in previous research (e.g., Pfordresher et al., 2010) and which includes a complex melody requiring a wide vocal range. Populations all over the world use the same melody in different languages. The song has also been chosen by many other researchers. For example, the "Happy Birthday" was used to examine interval matching among non-music majors' undergraduates (Price, 2000). When people sing in groups for celebration, they may start in a key of their own choosing, and thus likely not in the same key, and Price reported a number of "modulating singers" who did not maintain the key center for the duration of the song. Conversely, adult singers who sing regularly in an elective choir were shown to do this with great accuracy (Nichols & Liu, 2022).
The purpose of this study was to explore the patterns of performance in one song, Happy Birthday, based on a finding that singers tend to compress intervals (per Pfordresher & Brown, 2007;Pfordresher et al., 2010), and the hypothesis that this would be true especially for the large ascending interval in the song. The purpose was not to provide estimates of singing proficiency in child or adult populations nor to explain why this criterion song may be difficult to perform. Our focus is exploring the patterns of performance of a familiar song grounded in previous research indicating that ascending intervals higher in the range were more difficult for children in a study of pitch interval and pattern performance (Wolf, 2005). We were guided by the following question: How does performance by individuals vary by interval type and position in the song? Thus, we aimed to explore how overall performance can be predicted by early performance in the melody and whether the compression of specific intervals affects subsequent performance. The research questions guiding this report of a re-analysis in an ongoing line of research were as follows: 1. How often is the large 14th interval sung out of tune? 2. How does the performance of the 14th interval influence subsequent performance? 3. What specific intervals can be used to predict performance?

Data source
We chose to evaluate a subset of previously reported data from two studies in which the song Happy Birthday was used (Greenspon et al., 2017;Pfordresher & Brown, 2007). We requested a combined dataset in 2018-2019, and the data we analyzed included linear interval deviation scores in cents (100 equal cents designating half-step intervals). Our data included only the cent deviation data for 37 participants; no raw audio was included. At the time of data collection, participants were asked to complete a warm-up phase including vocal "sweeps," songs such as Happy Birthday, and followed by vocal pitch matching and pitch discrimination tasks.

Participants
As reported in the previous literature, the participants were adult non-professional singers recruited from an undergraduate recruitment pool. Participants were recruitment using campus flyers and compensated by credit in an undergraduate psychology course or compensated hourly for their time. Next, participants were reported to indicate normal hearing and no voice pathology. The dataset included singers of all genders, though the number of each was unknown to us (see Data Confirmation below).

Analysis
We examined each of the 25 sung pitches in the song from 37 individuals. We explored pitch performance in two forms, first as pitch values based on each individual's first pitch, on which the subsequent 24 pitches were assigned a deviation score from each expected pitch. Additionally, we explored interval scores as cent deviations from the expected (target) interval in each case. No assumptions can be made for whether the present subset of data we received reflects the overall distribution of demographic values in the previous studies.
The dataset consisted of deviation scores in cents for notes of the song compared to the expected pitch in each instance and deviation scores in cents for intervals in the song; the song contains 24 intervals across 25 pitches. The deviation of the first pitch was conceptualized as a starting pitch receiving a deviation score of zero. The expected interval between the first and second note is a unison, so deviation scores were expected to be zero, but naturally they varied by small and large degrees in the sample. A score is provided in Figure 1 with a mapping of note and interval reference numbers. Two versions of the dataset, one for individual pitches and one for individual intervals, were used in the following analysis. Unsigned deviation scores were used for descriptives, and signed deviation scores were used to explore a predictive model for accurate singing. One participant had a missing value, and we substituted a mean value to this score.

Data confirmation
In a separate analysis, we explored anonymized audio files from a different subset of individuals we received (N = 110) but who were not included in the acoustic analysis. We used this analysis to confirm the primary dataset was valid for exploring the singing of adult participants who were generally able to complete the singing task. Using the raw singing files, we identified the specific frequency (Hz) of the starting pitch. As expected, starting pitches indicated ranges from all genders and potential voice classifications (treble vs. bass clef singers) who generally began low in the range (Moore, 1991) and did not end in the same key they began. A review of raw audio files suggested a possibility for less reliable data for certain pitches such as the initial pitch on the syllable "Hap-" due to short rhythmic duration from an observed tendency for singers to pause on the second consonant, "p," truncating the first note of each phrase: The tendency for singers not to phonate the full duration of the first note resulted in a short rhythmic value which recurred in other iterations of the same syllable later in the song. In general, the primary dataset of pitch deviation scores plus the audio files from a different subset of the data leads us to confirm this was a population of collegiate students both with and without previous musical experience as reported in the original studies.

How often is the large 14th interval sung out of tune?
The 14th interval represents the largest ascending interval in the Happy Birthday song. Our primary interest was in performance of this interval, which we theorized to be more difficult to perform accurately. Only one participant sang a larger interval than was intended; where a 1200 cent change was expected corresponding to an ascending octave interval, this participant sang an interval of 1288 cents. The remaining participants (n = 36) performed the interval by compressing it (see Figure 2). Using a threshold of 50 cents for in-tune singing, 34 participants (out of 37) sang out of tune on the 14th interval.

How does the performance of the 14th interval influence performance on subsequent pitches?
The mean of unsigned deviations increased by 19.59 cents after the 14th interval, and the standard deviation increased by 54.20 cents. The median, however, decreased by 2.37 cents. Based on our preliminary analyses indicating reliability concerns in the scoring of the initial syllable in each phrase "Hap-", we removed intervals 1, 7, 13, and 20 ("Hap-") and re-calculated the data. Using this revised procedure, the mean of unsigned deviation after the 14 th interval increased more, 23.49 cents, and the standard deviation increased by 58.41 cents after the 14th interval (median increase of 1.44 cents). We chose a nonparametric approach of the bootstrap using median values. We resampled the data with replacement 5000 times, and each time took a difference (the absolute value of imitation errors before the 14th interval minus the absolute value of imitation errors after the 14th interval) in medians. The median of these 5000 differences in medians turned out to be .69 cents. The 95% confidence interval indicated that the difference between the performance of before and after the 14th interval was between -10.39 and 10.75, indicating that the difference in medians was not significant. We observed that the majority of the singers performed almost the same whether an interval was before or after the 14th interval, though after the 14th interval we found more extreme values.

What intervals can be used to predict performance?
To determine the predictive value of initial performance to performance on the ascending octave in the 14th interval, we explored a best-fit model from the initial 13 intervals using the signed data. We tested for the assumptions of multiple regression analyses (Tabachnick & Fidell, 2007). The normal Q-Q plot of standardized residuals indicated a violation of the normality assumption, as participant 20 did not follow the normal line; also, the Shapiro test was significant with p-value < .001. After removing participant 20, we proceeded with a final model constituted by the 4th, 7th, 10th, and 13th intervals, which meets the k * 7 threshold for independent variables for our sample size. Both the Normal Q-Q plot and the Shapiro test suggested errors were normally distributed. Variance Inflation Factors were greater than 0.05 and less than 10, indicating the absence of multicollinearity. The model of 4th, 7th, 10th, and 13th intervals explained 65% of the variance (adjusted) of 14th interval performance, F(4, 31) = 16.96, p < .001 (see Tables 1 and 2). With one cent increase in the 4th interval, performance on the 14th interval decreased by 1.00 cent. One cent increasing in the 7th interval increased the 14th interval by 0.84 cents. A deviation of one cent on the 10th interval resulted in 1.65 cents higher in the 14th interval. One cent increase in the 13th interval results in a decrease of the 14th interval by 0.90 cents.

Discussion
Our finding is that certain early intervals in the criterion song can be used to predict overall performance. While the large ascending 14th interval did not significantly affect performance after that interval, we did find more extreme values after the 14th interval. The compression of larger intervals we expected was realized in our re-analysis (per Pfordresher & Brown, 2007;Pfordresher et al., 2010); however, every song contains unique features and the present criterion song-with its large ascending octave interval-was used in a population of non-singers which we would expect to differ from a population of habitual singers (e.g., Nichols & Liu, 2022). For this song and for this population, we suggest overall performance would be indicated differently in terms of singing accuracy: without the 14th interval, participants might have demonstrated less interval compression and the overall mean accuracy would be quite higher (Price, 2000).
In light of a prediction model suggesting some degree of variance in the 14th interval performance being accounted for in initial intervals, we suggest that definitions of in-tune singing might not be suited for evaluating the overall accuracy of a singer or overall difficulty of song stimuli. Instead, performances may yield more accurate portions of singing and less accurate portions of singing based on the song stimuli. Furthermore, initial interval performance in a criterion song may predict accuracy overall, or accuracy on more difficult intervals, but initial performance at the beginning of a song may not a good predictor for every singer or in every song.
The criterion song represented several factors contributing to high difficulty including large intervals (Wolf, 2005) and a wide range (Rutkowski, 1990). If individuals did not choose a starting pitch intentionally (Nichols & Liu, 2022), the result is non-ideal conditions for accurate singing, whether solo or in unison with other voices. The large ascending 14th interval was performed least accurately, with nearly all singers compressing the interval, possibly to avoid that specific interval or to modulate the key to create ease in singing the intervals which follow. Some singers may perform a diminished interval while others use the diminished interval as an opportunity to modulate the key. The effect of intentionality here is deserving of future research in profiles of individual singers. A review of the audio files suggested to us that the starting pitch of each phrase, "Hap-," was often truncated for these non-professional singers, which we considered might lead to less reliable scoring of those notes. However, a comparison of performance prior to the 14th interval to pitches subsequent to the 14th interval without these syllables (1st, 7th, 13th, and 20th notes) did not yield increased accuracy (lower deviations).

Implications for music education
Interestingly, performance on the unison intervals (repeated pitch) like interval 7 and 13 in our model reflect the finding that singers do not universally sing these unison intervals in tune. Some individuals sang the unison intervals sharp while others flattened the second pitch of the interval. The 10th interval, an ascending fifth, was moderately and positively correlated to performance on the 14th interval (.59). Possibly, ease in singing small ascending intervals can be transferred to facilitating larger ascending intervals. However, notions of accuracy may not be useful indicators of individuals' ability when song range or singers' key selection is a more salient factor in vocal performances. Singing accuracy is a construct based on task-dependent features: difficult song stimuli may indicate lower singing accuracy compared to songs containing easier intervals or requiring small ranges. Previous research has used 25 cent thresholds for indicating overall accuracy among participants (e.g., Nichols & Wang, 2016) or 50 cents (e.g., Greenspon et al., 2017). Adults in the present subset of data would generally not have met either of these criteria (i.e., been deemed accurate), though a definition of inaccurate singing may be poorly applied here for a song of high difficulty employed in the general population. The participants here were adult non-professionals asked to sing a familiar song a cappella and in a solo singing condition without instruction or encouragement from a teacher. Participants sang low in their voices, which reflects findings in previous studies for children and adults (Moore, 1991), and adults currently singing in a choir may be expected to indicate more varied range choices (Nichols & Liu, 2022). Thus, choosing one's own key appropriate to the range of a song may be an important outcome of vocal instruction for children and adults, with the implication that the greater the difficulty of intervals or the greater the range, the more difficult it may be to perform accurately.

Conclusion
Singing alone and singing with others is a different experience. Children, for example, demonstrate different performance when singing alone or doubled by another voice (Nichols & Lorah,  2020), and also when doubled by another instrument such as the piano (Nichols, 2020). Furthermore, the criterion song reported here is unlike Star Spangled Banner, for which it is common to hear solo models who are chosen for their star status and singing ability. We surmise for Happy Birthday that a lack of accurate modeling in the classroom and for adults, the tendency to avoid choosing a common starting pitch in communal singing, may result in lower accuracy. The Happy Birthday song is a near-universal song of celebrations in many cultures. It is sung individually and in groups in a spontaneous way where a common starting key is generally not emphasized. Just one of many songs in any culture that might be sung communally, it represents a familiar song used ceremonially, if informally. While singers may not sing in the same key, they also may not perform the song accurately. This analysis of previously collected data indicated a range of vocal performance on criterion intervals representing at least one interval that appears difficult to execute in this population. The use of the criterion song, Happy Birthday, may be useful in future studies as a commonly known song but may yield challenges for participants and for researchers drawing conclusions on singing accuracy. Still, we suggest that performance on early intervals in complex songs can be a valid indicator of performance, particularly useful for longer melodic fragments or songs.

Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.