Bayley III in Vietnamese children: lessons for cross-cultural comparisons

Background: There are limited psychometric reports of construct validity following adaptation of the Bayley Scales of Infant and Toddler Development 3 rd edition (Bayley III). This paper aims to demonstrate a process of assessing reliability, validity, and gender equivalence of the adapted tool for Vietnamese children. Methods: We evaluated cognitive, fine motor, gross motor, expressive communication and receptive communication subtests of the adapted tool in 267 healthy urban Vietnamese children. Subsets of participants were used to evaluate inter-observer and test-retest reliability. Confirmatory factor analysis (CFA) was carried out to evaluate construct validity and measurement invariance between genders. Results: The adaptation demonstrated good inter-observer and test-retest reliability. CFA indicated that a construct representing a single underlying factor showed the best fit, although relationships between the observed scores and the latent traits underlying the scores varied between age groups. Within age groups, relationships between observed scores and these factors were not significantly influenced by gender. Conclusions: The Vietnamese Bayley III demonstrated good internal consistency and reliability. A latent structure with one general factor and additional residual correlations that change with age is supported by the theoretical understanding of child development. This is the first study to demonstrate gender invariance by age group. This adaptation is suitable for further research studies in urban Vietnamese children, but further work is needed to extend its applicability more broadly across Vietnam.


Introduction
Over the last 25 years, an unprecedented reduction in under-five mortality has been achieved under the Millennium Development Goals 1 . Following adoption of the Sustainable Development Goals in 2015 2 , there is now increased focus on children's early development. Valid and reliable child development assessment tools (CDATs) are vital to evaluate needs and assess outcomes of intervention programmes.
The Bayley Scales of Infant and Toddler Development, Third Edition (Bayley III) 3 , is widely used internationally to evaluate early child development. The tool assesses five domains: a) cognitive (91 items), b) language (receptive language, 49 items; expressive language, 48 items), c) motor (fine motor, 66 items; gross motor, 72 items), d) socio-emotional (35 items) plus e) adaptive behaviour (241 items). It was standardised on a cohort of 1700 US children, stratified by age, sex, parental education, race and geographic region (US norms) 3 . Raw scores are converted to scale scores and then to composite scores, which are used to determine the child's performance compared with these US norms. Worldwide these norms are commonly used as the reference population 4-6 .
The Bayley III was formulated on the principle that it measures underlying traits or latent factors. Confirmatory factor analysis (CFA) was used to demonstrate construct validity by evaluating relationships between test scores and different underlying traits/factors. The authors concluded that the test scores best modeled three underlying traits -motor, language and cognitive factors. This was evaluated on the total standardization sample of 1700 children, with the sample split into 4 age groups of 300-600 children per group. The manual does not explain the rationale for selecting the age groups 7 .
When an assessment tool is adapted for use in another cultural context, it is important to demonstrate that the relationship of the observed scores to the underlying hypothesised traits (i.e. factors) is comparable to the original. This process of establishing construct equivalence of the adapted tool ensures validity of the test in the new setting. Van de Vijver describes increasing levels of equivalence 8 , culminating in full score equivalence, where the relationships between the test scores and their construct scales (i.e. the theoretical framework) have both the same measurement unit and origins 9, 10 . This is the only situation where scores have the same distribution in both groups or cultures compared, making it appropriate to use score means for analyses of variance and t-tests for group comparisons 11 . Measurement invariance analysis can be used to assess construct equivalence between and within populations through a step-wise increase in model constraints. The best fitting model of construct validity is first compared between groups at baseline (i.e. configural) level, and then specific parameters in the models are increasingly constrained to assess invariance at different levels: metric invariance; scalar invariance, which permits construct-level comparisons between groups; 12 and finally strict invariance, although this is rarely achieved 13 . It is accepted that for cross-cultural comparisons scalar invariance is sufficient 12,14 .
By contrast, for within population comparisons (such as by gender) validity at the metric level is acceptable, implying that respondents from both groups understand the test and respond in similar fashion 8, 15 . Standardisation data is not publicly available, so evaluation is limited to determining whether the same original construct structure holds true in the new population.
For within culture or between population comparisons, differences in scores between groups may be biased by group membership rather than indicating a true difference between the groups. For example, there have been consistent gender differences in pre-term neonatal outcomes in studies which have used the Bayley scales in the US 16 and Sweden 17 . However, there is no data on whether the Bayley III is gender invariant, i.e. whether the scores and their relationship to the underlying constructs being assessed are the same, irrespective of gender, within the same population. Gender differences in behaviour between cultures are well described 14 , and an adapted tool may demonstrate different effects of gender on the theoretical constructs.
Establishing robust psychometric properties for an adapted CDAT is important to allow meaningful interpretation of the data collected using the tool. Here we describe in detail the processes we used to adapt the Bayley III for use in an urban setting in Vietnam, as well as our assessments of reliability, construct validity and gender equivalence of the adapted tool in a group of healthy Vietnamese children. This undertaking was part of a wider programme of work focussed on evaluating neurodevelopmental outcomes following severe hand foot and mouth disease in Vietnam.

Adaptation of the Bayley III tool
In summary, adaption consisted of translation, evaluation of cultural modifications through the group's experience, backtranslation, and implementation of the test in a group of volunteers (pilot testing) resulting in further modifications (see extended data 18 ).
After direct translation into Vietnamese by 3 psychologists, we adapted the cognitive, language and motor domains of the Bayley III in line with guidance from the International Test Commission 19 , and recommendations from publications on reducing cultural bias (Table 1) 20-23 . Six Vietnamese psychologists and one special needs teacher reviewed the direct translation for ambiguity, following which a post-doctoral language expert gave further advice 24 . An independent Vietnamese bioscientist then carried out a direct back-translation, and any discrepancies were reviewed and amended by another two independent bio-scientists who had lived for 2 or more years with their children in the US. After pilot testing on 30 children a final version was created. Additionally, 18 children from the pilot testing had their assessments videoed. These eighteen videos were used to assess inter-observer reliability.
Pearson Education Inc. granted the study team a licence for the translation and adaptation. The seven assessors underwent 6 months of training.

Study sites and selection of participants
The children in this study were enrolled as controls for a cohort study of enterovirus 71 infection and neurological development. The inclusion criteria were; children aged less than 4 years from District 8 HCMC. The exclusion criteria were; history of chronic severe illness (e.g congenital cardiac disease, epilepsy), ex-premature (born <37 weeks gestation), prior intensive care admission, or known developmental delay. Potential participants attending one of three specific kindergartens in District 8 in Ho Chi Minh City (HCMC), were approached about the study. Participants were also recruited from a long-term birth cohort run as a collaboration between Oxford University Clinical Research Unit (OUCRU) and Hung Vuong Government Maternity Hospital 25 . Mothers of these infants were approached about the Bayley III evaluation when they attended routine study visits at 4, 9, 12 or 18 months after birth. Additional recruitment occurred at three government primary care clinics, administered by the Preventive Medicine Centre in District 8, HCMC, on routine immunisation days. Recruitment from all sites occurred between September 2013-2014. Any child deemed to have a developmental problem was excluded from the study. Children were tested up to three times in 18 months resulting in a total of 476 assessments. Evaluation of construct validity used the first assessment data (N=267).

Administration of the Bayley III
Following written informed consent, the parent/guardians of study participants were given an appointment for Bayley III testing, which was performed in a quiet private room either at the recruitment site or at OUCRU. Administration followed the original Bayley manual guidance. Cognitive, receptive and expressive language, fine and gross motor subtests were administered to each child on two occasions six months apart by the trained study staff. The child's age in months determined the start test item for each subtest. A standard case report form was used to record demographic and socio-economic data (see extended data 18 ). Evaluation and statistical methodology Reliability: Internal consistency was evaluated using Cronbach's alpha (acceptable values 0.7-0.9) 26 . All seven assessors independently scored the same 18 videos (pilot testing) for inter-observer reliability. This was assessed using intra-class correlation (ICC). Test-retest reliability was assessed with Pearson's correlation, carried out in a convenience sample of study participants where the parents/guardians were willing to return within 2 weeks.

Ethical considerations
Construct validity: CFA was used to determine the underlying structure using data from 267 children. Due to the limited sample size, we assessed three age groups 0-12 months (N=86), 13-24 months (N=110), and 25-42 months (N=71). A single factor (general neurodevelopment) was specified in the CFA. If the model fit was not acceptable, modification indices were examined to identify areas of model misfit. Particularly, it highlighted items which shared common variance in addition to the underlying factor, suggesting correlated residuals that could better explain the observed pattern in the scores. Including these correlated residuals in the model would improve model fit 27 .
Measurement invariance between genders: This was carried out using multiple group confirmatory factor analysis (MGCFA) for each age group, using the pre-specified best model from CFA. MGCFA progressively places constraints onto the model and if the model continues to show adequate fit, measurement invariance at this level is demonstrated.
Goodness of fit indices: The following measures of overall model fit were used, each with standard indices for goodness of fit: root mean square error of approximation (RMSEA, acceptable fit <0.08, good fit <0.05); 28 the comparative fit index (CFI, acceptable fit >0.90, good fit >0.95); 29 the Tucker-Lewis index (TLI, acceptable fit >0.90, good fit >0.95) 29 . A p-value of 0.05 was taken to be significant in all analyses. Akaike's information criterion (AIC) and Bayesian information criterion (BIC) were  32,34,35 .

Characteristics of the study population
A total of 267 children aged 3-43 months were enrolled in the study between September 2013 and January 2014. Among this group, 191 children (72%) were recruited from the birth cohort, 54 (20%) from kindergartens, and 22 (8%) from the government primary care clinics.

Construct validity and gender measurement invariance
We present here the CFA results for a general factor and measurement invariance by age group (Table 4-Table 6) Group 1 (0-12 months): The construct structure with one general factor (Model 1) demonstrated unacceptable model fit, with RMSEA above 0.1 and significant chi square. The modification indices suggested residual correlation between gross and fine motor domains indicating the scores observed in gross and fine   Non-significant Chi square statistics at p=0.05 level and RMSEA < 0.05 indicate good fit. A confidence interval <0.08 derived from RMSEA was also taken as an indicator of good fit. CFI has acceptable fit at 0.9, and good fit at >0.957. TLI has good fit >0.9. A p-value of 0.05 was taken to be significant in all analyses.
AIC: Lower is better. Attempts to select models that are the most parsimonious/efficient representations of the observed data. BIC is similar to AIC but more conservative. motor skills share additional variance that is not explained by the general factor. Once these residuals were allowed to be correlated in the model (Model 1 FM~~GM), the model fit improved significantly (non-significant chi square test, reduction in AIC and BIC) with RMSEA at 0.07 and almost perfect TLI and CFI.
Group 1 by gender: The male group for Model 1 showed acceptable model fit, while the female group fit optimally in the model with residual correlation of gross and fine motor skills. We carried out MG-CFA on the model with residual correlation of motor domains, and established that strict invariance was achieved, as the differences in chi square between nested models were not significant.
Group 2 (12-24 months): The model fit for the one-factor solution (Model 1) was not acceptable for Group 2, as the chi square test was significant, RMSEA was 0.23, and TLI was below 0.9. The modification indices suggested residual correlation between receptive (RC) and expressive (EC) domains would improve the model fit. Using Model 1-EC~~RC, the model fit was greatly improved, (non-significant chi square test, reduction in AIC and BIC). Subsequently, we carried out MG-CFA using the revised model. The results showed that the model fit for all models was very good, and the change in model fit was not significant, indicating that strict invariance was established.
Group 3 (24-43 months): The CFA result using Model 1 demonstrated acceptable model characteristics, with RMSEA lower than 0.1 and CFI and TLI both above 0.95. Therefore, we accepted this model and performed MG-CFA afterwards. Consistently, strict invariance was identified for Group 3, as the change in model fit was consistently non-significant.
MG-CFA on the three groups used the models derived from the CFA analysis. Strict gender invariance was achieved for all groups, with correlated residuals constrained in Groups 1 and 2.
In Group 1 and Group 3, no significant difference in latent means could be observed between the two genders. In Group 2, there Legend for Table 6: X2 diff -chi-square difference between models, Df diff; change in degrees of freedom between models. Between nested models, if P value> .01 (insignificant)--the fit of the model has not been significantly hindered by introducing the additional constraints so the increase in χ2 value is not significant in reducing model fit.
ΔCFI -if <0.01 there is not a significant change in model fit between nested models.
Configural Invariance: baseline model to which we can compare more restrictive models. Same common factors across groups Metric Invariance: Common factors have the same meaning across groups Scalar Invariance: Group differences in observed means will be directly related to group differences in factor means Strict Invariance: Group differences in observed means and variances will equal corresponding group differences in factor means and variances was a marginally significant (p=0.05) difference in the latent means between genders, suggesting that girls performed better than boys on this tool of general neurodevelopment at the age of 12-24 months.

Discussion
Viet Nam is the fourteenth most populous country in the world 38 . Achievements on the Millennium Development Goal targets put the country in a good position to tackle the 2030 Sustainable Development Goals 39 , for which a reliable and valid CDAT is required to assess needs and track progress.
This study demonstrates that our adaptation of the Bayley III for use in an urban Vietnamese population has good reliability, and also meets strict invariance criteria for gender invariance by age group. However, the structure of the adapted tool differs slightly from the original US version. In our adaptation we identified three different models for the three age groups we evaluated. The changes we made are consistent with early development theories, which suggest that initial skills attained in the first year after birth are primarily motor, followed by language development increasing from the second year of life onwards 40 43 , changing from a two-factor structure in early childhood to a three-factor structure among the teenagers in a cohort of 688 children aged 6-15 years.
The original Bayley study used data from a stratified sample of 1700 children from across the US. In contrast, this study focused on a smaller sample of Vietnamese children from an urban district in HCMC, and the socioeconomic details for the participants were significantly different from publicly available census data for the general Vietnamese population. Although this is a clear limitation of the study, the work represents the first attempt to develop a locally relevant adapted tool for Vietnam and to formally evaluate the psychometric properties of the adapted tool. The research paves the way for further work going forward, potentially expanding data gathering to include rural populations and to extend the tool's applicability more broadly across Vietnam. For the present, this adaptation has both clinical utility and is suitable for use in research studies involving urban Vietnamese children, and should prove to be a valuable instrument for evaluating early child development in this population.
• "What is already known on this topic" -There is limited published literature on the process of validating Bayley III adaptations.
-Establishing robust psychometric properties for an adapted child development assessment tool is important to allow meaningful interpretation of data collected using the tool -Reported differences in scores between genders on Bayley III may be due to the test having different developmental meaning between genders.
• "What this study adds" -This study outlines a method of assessing reliability and construct validity of an adapted test.
-The construct structure of the Vietnamese Bayley III varied by age in keeping with expected child development. - The adaptation was not biased by gender and is suitable for use in future studies in urban Vietnamese populations. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.