Validity of the Montreal Battery of Evaluation of Amusia : An Analysis Using Structural Equation Modeling

The Montreal Battery of Evaluation of Amusia (MBEA) is the gold standard for diagnosing amusia. We aimed to evaluate its factorial and convergent validity. Data were collected for the MBEA and a self-report Amusic Dysfunction Inventory on a non-random sample (n = 249), and the following Structural Equation Modeling (SEM) procedures were conducted: confirmatory factor analysis of the theoretical model; exploratory SEM for alternative non-restricted factor solutions; and structural models with each of these solutions as predictors of the inventory’s items. The theoretical model did not prove acceptable goodness of fit, and twoand threefactor non-restricted models were better-fitted solutions for Scale, Contour and Interval tests, and Meter and Memory tests, respectively, than the theoretical one-factor model. This may reflect distinct perceptual processes rela ted to neurocognitive demand. The non-restricted models of Scale, Meter and Memory showed to be acceptable predictors of self-reported capacity for melodic perception, vocal production, rhythmic coordination, and memory.

Assessment of amusia has been almost exclusively performed using the Montreal Battery of Evaluation of Amusia (MBEA) (Pfeifer & Hamann, 2015).
Though not explicitly stated as a formal measurement model in the original theoretical scheme (see Peretz et al., 2003), the MBEA evaluates cognitive processing of music through six first-order factors: scale, contour, and interval, as part of a melodic organization second-order factor; rhythm and meter, as part of a temporal organization second-order factor; and memory-recognition, which is conditioned by melodic-and temporal organization due to the nature of the task, which implies the previous execution of the melodic and temporal tests.
The MBEA theoretical model assumes modularity of music processing in three levels: 1) with regard to other cognitive functions (Peretz & Coltheart, 2003); 2) between second-order factors (e.g., impairment of melodic organization could exist in the presence of non-impaired temporal organization [Hyde & Peretz, 2004]; impairment of memory-recognition could exist in the presence of non-impaired melodic or temporal organization [Peretz et al., 2003]), and 3) between first-order factors (e.g., impairment of interval processing could exist in the presence of non-impaired contour processing, but not vice versa; and impairment of rhythm could exist in the absence of impaired meter [Peretz et al., 2003]).
Research using the MBEA has seen a significant rising; because of this, its validity is being rigorously tested, as it could be expected of such a relatively new and almost unique measure.Several observations can or have been made in this regard.First, the supposed modularity of music processing is in contradiction with the MBEA's composite score ultimately proposed by Peretz et al. (2003) as a unitary measure (a third-order factor) of music perception and memory functioning.Second, the only study that has undertaken the task of analyzing the items' dimensionality of the MBEA, has reported failure to replicate the assumed unidimensionality of five of its tests (Nunes-Silva & Haase, 2012).Third, one study (Henry & McAuley, 2013) has highlighted the crucial involvement of decision-making processes in the simple behavior of listening and responding to the MBEA items; these processes are not considered in the theoretical model in which the battery is founded (justifiably because of parsimonious reasons), but could have a modifying role in the structure of the theory.Fourth, little is still known regarding the relation of performance of subjects on the MBEA and actual specific amusic behaviors (e.g., ability to distinguish melodies without lyrics, minimal capacity to sing in tune or to follow the basic rhythm of a song), with several studies approaching the phenomena through the inclusion of non-representative samples of so-called "self-declared amusics" with no clear or measured diagnostic criteria other than the MBEA itself.Fifth, there is still some discrepancy concerning the use of the MBEA in terms of the number of subtests, items, scoring, cutoff scores, diagnostic accuracy and consequent estimation of prevalence (Henry & McAuley, 2013;Peretz & Vuvan, 2017;Pfeifer & Hamann, 2015).
For our consideration, in order to reach a better understanding of amusia, and all other neurocognitive features that might be addressed with, a more precise comprehension of the MBEA is needed, especially as it has been taken for granted as the gold standard for proving the existence of the very phenomenon it intends to measure.None of the studies targeting this knowledge gap (Henry & McAuley, 2013;Nunes-Silva & Haase, 2012;Pfeifer & Hamann, 2015) undertakes the task of thoroughly assessing the validity of the theoretical model underlying the MBEA.We believe that this assessment may provide information for further refinement of the theory behind the MBEA, and could have practical implications concerning the pertinence of each of its tests and items, and the potential relation of specific music perceptual domains (e.g., perception of pitch and perception of rhythm) to other cognitive and neural processes (e.g., decision-making and frontal lobes).
Structural equation modeling (SEM) provides a way to address this issue based on systematic fit assessment procedures and estimation of relationships between latent constructs corrected for measurement error (Asparouhov & Muthén, 2009;Morin, Arens, & Marsh, 2016).Factorial validity of the MBEA can be assessed using: a) Confirmatory Factor Analysis (CFA) to empirically test the original theoretical model, assuming that specific indicators are only related to specific factors (e.g., variances in items of the MBEA's Rhythm test are caused by one and only one pre-specified factor measuring rhythm perception); and b) Exploratory SEM (ESEM) to estimate non-restricted relationships between multiple measures in the absence of a valid pre-specified model (e.g., variances in items of the MBEA's Rhythm test may be caused by two or more interrelated factors measuring different features of rhythm perception).For evaluation of the criterion validity, SEM can also be utilized to test the capacity of the best factorial models (restrained or unrestrained) to predict specific outcomes (e.g., a three factor model obtained through ESEM may have better associations with selfreported rhythm difficulties than a one factor model obtained through CFA).
The overall purpose of this study was to comprehensively assess the validity of the MBEA, aiming for three objectives: 1) to test the factorial structure of the MBEA's original theoretical model; 2) to explore an alternative factorial structure for the MBEA using ESEM; 3) to evaluate the convergent validity of the MBEA in relation to specific indicators of self-perceived amusic dysfunction.

Participants
Data were collected in three Mexican cities (Mexico City, Veracruz, and Merida) during July 2014-April 2015, through convenience sampling.Inclusion criteria were: 1) age between 14 and 70 years-old (the range was established considering previous studies with this minimum age [Nunes-Silva & Haase, 2012], and also aiming to reduce the effect of age-associated cognitive decline [Deary et al., 2009]); 2) literacy; and 3) oral informed consent after assuring the participant's complete understanding of confidentiality, research purposes, presence of low risks (possible feelings of lost time, boredom, fatigue, or minimal hearing discomfort), procedures for minimizing risks (comfortable volume level, and breaks between tests), and voluntary completion of activities.Exclusion criteria were checked through the participant's self-report of: history of neuropathology (one or more of the following formal diagnoses: traumatic brain injury, stroke, epilepsy, neuroinfection, dementia, schizophrenia, or other neurodegenerative disease); compromised hearing acuity (formal diagnosis of damage to tympanic membrane or inner ear structures, or subjective complaint of disability); and formal music training (two or more years of study in a music school).

Measures
MBEA items consist of piano melodies that participants listen to through headphones.For Scale, Contour, Interval and Rhythm tests, participants are presented with 31 pairs of monophonic melodies and asked to judge whether the two melodies are the same or different.A catch trial is included randomly within each of the first four tests to assure that the participant is paying attention.For Meter test, participants are presented with a single homophonic melody, and asked to judge if the presented melody is a march or a waltz.For Memory test, a single monophonic melody is played and participants must judge whether they heard it or not during previous tests.A 30-point scale score is computed for each of the tests, and an average score can be calculated from the six subtests to obtain a global measure of music cognition (Peretz et al., 2003).Though the diagnostic accuracy of this scoring procedure has been questioned (Henry & McAuley, 2011), we decided to retain it for the descriptive analysis because the theoretical factorial structure of the MBEA was originally developed on the evidence provided by this scoring system, under the practical assumption that response to individual items could be added and averaged into broader indexes (e.g., a global score), and because this system is still widely used in research of music perception and amusia (for examples, see : Fujito et al., 2018;Peretz & Vuvan, 2017;Tang et al., 2018).For the purpose of the main analyses (CFA and ESEM), we used the participant's responses to individual items.
A 9-item self-report Amusic Dysfunction Inventory (ADI) was designed ex profeso for this study as a way to explore amusic complaints in four domains: melodic perception (item 1: "I can tell when an instrument is out of tune", item 2: "I'm capable of clearly noticing an incorrect note in a familiar melody"); rhythmic coordination (item 3: "I dance fluently and to the rhythm of music", item 4: "It is difficult for me to follow the rhythm of a song with my hands or my feet"); vocal production (item 5: "People say I sing out of tune", item 6: "I sing out of tune"); and memory (item 7: "I have difficulty in recognizing the melody of a song when it has no lyrics", item 8: "I have troubles remembering melodies that I have heard several times", item 9: "I can only remember lyrics of the songs and I often forget the melodies") (all items' statements are literally translated from Spanish).The inventory is based on the MBEA's theoretical dimensions of melodic/temporal perception and memory, and on complaints of daily-life impairments frequently reported by amusic individuals, or specific dysfunctions observed by researchers (Cuddy et al., 2005;Peretz et al., 2003).Participants were asked to rate frequency (1 -"Never", 2 -"Rarely", 3 -"Frequently", 4 -"Always" ) of behaviors and situations that could be related to amusic dysfunction.
A demographic questionnaire included items about gender, age, years of education (completed levels), handedness, and a brief checklist of exclusion criteria.

Procedure
After an oral informed consent and a brief checklist for exclusion criteria, the MBEA and the ADI were computer-administered.Participants were instructed to register their responses on spreadsheets, arranged as follows: two columns for responses of each of the MBEA tests ("Yes" or "No" for all the tests except Meter for which the responses were "March" or "Waltz"), and four columns for the Likert-type responses of the ADI.All evaluations were administered in quiet settings, during one session, and carried out by psychologists trained in the procedures of the study.
All procedures were in accordance with the declaration of Helsinki, and approved by the Research Committee of the Anahuac University.

Statistical Analyses
Gender, age, years of education, and handedness, as well as sum scores of the MBEA and distribution of responses on the ADI were described as mean (standard deviation) or frequencies (percentages) for continuous and categorical data, respectively, in order to charac-terize the sample of participants.Statistical differences and correlations with MBEA global score were computed for each of the demographics aiming to identify significant (p < .01)confounders.All missing values were reported for specific variables.This analytical procedure was performed with IBM SPSS Statistics Version 22.
To assess factorial validity of the MBEA, the three one-factor models corresponding to theoretical dimensions of the MBEA were individually assessed through CFA, including: all 90 individual items from Scale, Contour, and Interval tests as indicators of melodic organization, all 60 individual items from Rhythm and Meter as indicators of temporal organization, and the 30 individual items of Memory test as indicators of the memory-recognition dimension.A one-factor model using all pooled items was also run to assess the global dimension of the MBEA.For every model, previously identified confounders were controlled.To determine the goodness of fit of the models, the following indices and correspondent cutoff values were taken into account: p > .05 in chi square test (χ 2 ); comparative fit index (CFI) > .95;Tucker-Lewis index (TLI) > .90;and root-mean-square error of approximation (RMSEA) < .05(Hu & Bentler, 1999;Schreiber, Stage, King, Nora, & Barlow, 2015).ESEM was conducted to consecutively assess goodness of fit of different non-restricted multifactor models of melodic and temporal dimensions, as well as for individual MBEA tests (Scale-Memory) using Geomin rotation.To choose a specific model, chi square test difference with scaling factor was utilized, considering p < .01 to avoid type I error.Significant (p < .05)standardized factor loadings and betweenfactors covariations were identified for each of the retained Scale-Memory models.To evaluate criterion validity of the MBEA, six independent structural models were performed using these factorial solutions as predictors of ADI's indicators, as prototypically depicted in Figure 1.Identified confounders were also controlled for both analytical procedures.
For all SEM analyses, we decided to use individual items as ordered categorical indicators, instead of using scores based on the aggrega-tion of the correct answers (MBEA scoring system); this procedure allows us to obtain latent variables that represent non-observable traits that underlie the responses of the test (Bollen & Lennox, 1991).The use of linear composites (sum of correct items) was discarded because coefficients estimated with this method may be upwardly or downwardly biased.All SEM procedures were conducted in Mplus Version 6.12 (Muthén & Muthén, 1998-2011) using the weighted least squares (WLSMV) means and variance adjusted estimator.

Demographic Characteristics, MBEA Scores, and Responses to ADI
From 261 participants that concluded the assessment, six individuals were excluded from analysis due to a posteriori self-report of neuropathology history (n = 5) and compromised hearing acuity (n = 1), and six were excluded due to several discrepancies between item number and correspondent response on one or more MBEA tests.A total of 249 cases were included in the analyses.1), but only the former proved p value < .01,and thus was considered as confounder in the main analyses.
Table 2 continued

Discussion
The main purpose of this study was to evaluate the validity of the MBEA by thoroughly analyzing the original and alternative factorial structures via systematic fit assessment procedures, identification of latent constructs, and estimation of relationships between them and with regard to self-report of specific amusicrelated impairments.

Factorial Validity
Overall, the findings strongly suggest that the theoretical factorial structure of the MBEA is not a good-fitted measurement model of music neurocognition.More specifically, neither the assumed three second-order factor structure (melodic and temporal organization) of the battery, nor the presumed unifactoriality of each of its tests, stood empirical proof.Instead, the results of ESEM suggest to discard melodic organization and temporal organization as composite measures, and to consider multifactoriality for each of the tests.
For the cases of Scale, Contour, Interval, and Rhythm (Scale-Rhythm), respective two-factor solutions showed better goodness-of-fit when compared to other multifactorial alternatives.Observing the factor loading patterns of these four models, a clear tendency can be detected: F1 was mostly loaded by items containing altered stimuli as the second melody (scale-, contour-, interval-, and rhythm-violated conditions, according to Peretz et al. [2003]), and F2 was mainly comprised by non-violated conditions.We suggest labeling these factors: assessment of difference (AoD), and assessment of sameness (AoS), respectively.Both factors might be reflecting distinct neurocognitive processes or degrees of ability needed to perceive, retain, and compare the two melodies of each trial, as well as different degrees of decision bias partially founded on the difficulty to process each melody.
As already noted by Henry and McAuley (2013) using the signal detection theory, this response process is the result of the additive contributions of the sensitivity of the listener to correctly discriminate between same versus different melodies and his/her response bias, which in the case of the MBEA seems to originate from a systematic measurement error founded in the dichotomous options of response for each of its tests.Thus, the patterns of response may be an effect of common-method bias attributable to the MBEA (and thus a serious limitation of the battery) and not the expression of a critical feature of music cognition itself.However, taking into account the nature of the ESEM models (latent variables are free of measurement errors [Bollen, 1984]), our results suggest that the obtained models are independent of the response bias, and the source of the responses for same versus different is the consequence of two independent variations within the process of perception-retention-comparison-decision, which determine the final response of the listener.Knowing which of these processes is more critical for diagnosis of amusia falls beyond the scope of this study.
Cross-loadings between AoD and AoS further shed light on this process, as most of the covariates shared negative directions, possibly meaning that in order to produce an accurate assessment of the difference or similarity between melodies, individuals ought to execute one process of perception-retention-comparison-decision while restricting the use of the other (e.g., involvement of AoS -in theory more suitable for less neurocognitively demanding trials -for responding to trials with violated conditions might produce a higher rate of inaccurate responses).
In the case of the Memory test, a two-factor solution, though it showed acceptable goodness-of-fit, was not statistically better informative than a tree-factor model.Many of the items loaded distinctively in F1 and F2 in a similar pattern to Scale-Rhythm tests, suggesting a similar distinction of assessment processes for discriminating between recently learned and newly perceived musical material, possibly founded on our proposed assessment processes related to neurocognitive demand.Nonetheless, this three-factor model displayed several heterogeneous cross-loadings, making it harder to parsimoniously support this interpretation.Dependence of Memory test on perception and performance on the rest of the previous tests might have influenced this pattern of modeling.

Convergent Validity
Exploring convergent validity, the value of these two-factor solutions is further stretched by the fact that mostly AoD for each Scale-Rhythm test was moderately associated (β ranging from -.14 to -.42) with ADI's indicators.From all of these tests, Scale's AoD associated better with expected outcomes related to vocal production and melodic perception, whereas Contour and Interval displayed more heterogeneous covariation of AoD and AoS with all of the outcome indicators, nonetheless biased to AoD.This may be explained by the fact that Scale test seems to have better accuracy for diagnosis of amusia (Peretz & Vuvan, 2017;Goulet, Moreau, Robitaille, & Peretz, 2012) (for this reason, it has been used as a screening measure of amusia in some studies [McDonald & Stewart, 2008;Peretz et al., 2008]).
Contrary as might be expected, no association between Rhythm test's factors and ADI's indicators of rhythmic coordination was found, but only low associations with some of the rest of the ADI.Furthermore, low associations between ADI's memory items and Scale-Interval's AoD were found, but only one ADI's memory indicator (self-perceived capacity to remember song lyrics but not melodies) proved a meaningful covariation with F3 of MBEA's Memory test.Interestingly, indicators of melodic perception, rhythmic coordination, and vocal production (signing out of tune), also displayed moderate covariations with Memory's F3.
This pattern of relationships may move forward the hypothesis of AoD and AoS as two different assessment processes strongly related to task complexity, meaning that Scale's AoD and Memory's F3 items may be fairly more neurocognitively demanding, and thus could be closer to the complexity of real-life musical behaviors.Other studies have already noted the sensitivity and specificity of the Scale and Memory test to detect amusia (Peretz & Vuvan, 2017;Pfeifer & Hamann, 2015;Henry & McAuley, 2013), although one study has reported scale processing as a rather automated function of the brain (Brattico, Tervaniemi, Näätänen, & Peretz, 2006).

Special Considerations for Meter test
For this test, a three-factor solution showed a better goodness-of-fit, though with several cross-loadings between two or even all its factors, and no clear profile of the items as to characterize latent constructs.We hypothesize that, similar to the other MBEA tests, the distinction between these factors might rely on different degrees of neurocognitive demand, perhaps with regard to interaction between meter pulse and more detailed musical features of the pieces comprising the test, such as tempo, saliency of rhythmic chords, or complexity of note durations (e.g., more frequency of crotchets or quavers).Insufficient number of items and thus non-representativeness of musical features, however, prevent this analysis.
Concerning its convergent validity, Meter's F1 and F3 (both containing fewer items than F2) were significantly associated with expected ADI's indicators pertaining to rhythmic coordination; particularly, F1 displayed inverse association with self-perceived difficulty to follow rhythm with hands or feet and to dance fluidly, whereas F2 showed positive association with the latter indicator.Interestingly though, very similar patterns were also observed in relation to ADI's indicators of perception of melody.These patterns may support our hypothesis (e.g., F1 reflects a different degree of neurocognitive demand than F2); however, not having any more data to support this assumption, it is beyond the scope of this work to speculate on the matter.Thoughtful questioning about the inclusion of the Meter test in the MBEA total scoring can be stated though, as has been somewhat evidenced by other results that signal its particular behavior in contrast with the rest of the MBEA tests (Henry & McAuley, 2013;Nunes-Silva & Haasse, 2012;Paraskevopoulos, Tsapkini, & Peretz, 2010;Toledo-Fernández & Salvador-Cruz, 2015).

Limitations
First, the use of non-random sampling might limit the external validity of the results due to lack of methodological control of common confounders in neuropsychological testing, such as age and years of education.This issue was addressed in the case of age via statistical con-trol within the tested models.We avoid statistical control of education based on its very weak correlation with MBEA's global score, and because the concept of amusia itself excludes it as a confounder (Ayotte et al., 2002).
Second, the sample size could also limit the reach of the findings because of the number of MBEA's indicators included in the analyzed models.Particularly, CFA models require considerably large sample size to attain an admissible solution when complex models are assessed (e.g., models with high-order factors such as the one proposed by Peretz et al. [2003]).However, the flexibility of ESEM models decreases due to the lack of restriction in the covariance matrices and the use of Geomin rotation (Asparouhov & Muthén, 2009).
Third, items of the ADI were not rigorously developed through a formal methodological process (e.g., items' pooling, piloting, or review by judges) but rather designed ex profeso based only on previous questionnaires, reports of amusics' common complaints of daily-life impairments and laboratory-observed dysfunctions (Cuddy et al., 2005;Ayotte et al., 2002).It is interesting to note that recent questionnaires with similar evaluation of melodic perception, vocal production, rhythmic ability and memory were developed contemporarily to our ADI (Müllensiefen, Gingras, Musil, & Stewart, 2014;Pfeiffer & Hamann, 2015), and could be used for further validation.Lastly on this matter, the MBEA being the gold standard for diagnosis of amusia, the criterion validity of the ADI could not be tested a priori either.We believe that, in view of the results and the lack of another plausible measure of self-report amusic dysfuntion, this study stands as a first examination of the validity of the ADI's items.
Fourth, dependence of observations might have influenced the associations between MBEA's and ADI's outcomes, since participants may have judged their self-perceived musical capacities predisposed by their perceived self-efficacy on the MBEA tests.This assessment procedure, however, is not rare in studies using the MBEA (Cuddy et al., 2005;Pfeifer & Hamann, 2015).Ulterior studies could employ random or alternate sequences of the assessment procedure, or use ecologically-valid behavioral measures (e.g., singing, hand coordination, dancing, etc.), aiming to avoid this bias.
Lastly, and thought not exactly a limitation of our study's design per se, it is important to highlight the possible influence of acculturation in the observed performance of our sample in the MBEA.Evidence for the effects of musical acculturation on the MBEA has been reported for the Greek population, since their musical system significantly differs from that of the Western world (Paraskevopoulos et al., 2010).Considering that most of the Mexican popular music is founded on the Western tonal system, that for centuries it has received influences from European and North American music, and that Mexicans are currently highly exposed to international music which is also based on this musical system, we believe that our findings could be transferred to other populations.Further cross-cultural studies are needed in order to test this assumption.

Conclusion
Our analyses showed that alternative goodfitted models unveil an even more complex phenomenon of music neurocognition as measured by neuropsychological testing, and the insufficiency of the current, most used measurement model for diagnosing it.Further neuropsychological research on the MBEA should be conducted using latent variable modeling to evaluate these measurement models in accurately pre-diagnosed amusic individuals, as well as in different groups with plausibly-related brain pathologies (e.g., temporal epilepsy, Alzheimer, Parkinson, schizophrenia), while rigorously testing the diagnostic precision of the alternative factorial models, their assumed dissociation with other neurocognitive functions, and their ecological validity.

Figure 1
Figure 1 Prototypical representation of ESEM of the MBEA and association with ADI indicators.Abbreviations: MBEA = Montreal Battery of Evaluation of Amusia; F = Factor; ADI = Amusic Dysfunction Inventory.Left side of the figure represents prototypical non-restricted models with different number of factors.Right side of the figure represents path analysis, with ESEM factors as predictors of specific ADI's indicators.Confounders are not depicted in the figure.Squares represent observed variables (indicators) and circles represent unobserved variables (factors).Full unidirectional arrows linked to indicators or factors represent the item uniqueness or factor disturbances.Full unidirectional arrows pointing to indicators represent measurement error.Bidirectional full arrows linking ovals represent factor covariances and correlations.Bidirectional dashed arrows connecting single ovals represent factor variances(Morin et al., 2016).Three-pointed vertical lines indicate ellipsis (e.g., 1, 2, 3…10).

Table 1
Demographic characteristics and MBEA scores Differences and correlations are conducted with the MBEA global score as the dependent variable; b 1 missing value; c 2 missing values.Abbreviations: MBEA -Montreal Battery of Evaluation of Amusia.

Table 2
Goodness-of-fit and difference testing for the MBEA theoretical and alternative model

Table 2
Goodness-of-fit and difference testing for the MBEA theoretical and alternative model