Motivation towards mathematics from 1980 to 2015: Exploring the feasibility of trend scaling

The Trends in International Mathematics and Science Study (TIMSS) has been assessing students ’ attitudes every fourth year since 1995. The trend scaling of these constructs started in 2011, fueling interest in exploring how different education systems perform regarding affective outcomes of education. This study explored the feasibility of establishing long-term motivational scales extended with the Second International Mathematics Study administered between 1976 and 1982. We investigated whether cross-cultural comparability holds and how different methodological approaches influence the long-term scaling of motivation towards mathematics. We used grade eight data from five educational systems that have participated in every time point up to 2015. We followed three alternatives: an item response theory, a confirmatory factor analysis, and a market-basket approach. Our results show that the three methods provide similar trends at the country level and high correlations at the student level. We discuss methodological implications in the context of international large-scale assessments.

The Trends in International Mathematics and Science Study (TIMSS) includes affective constructs, such as students' attitudes towards mathematics employing student background questionnaires.Including trend scaling for affective constructs has only been started recently in TIMSS 2011 (Martin et al., 2016).It is important to explore the possibilities of extending these trend scales because country-level longitudinal data facilitates powerful analytical approaches to address causal research questions.
In the present study, we focused on the feasibility of extending the TIMSS trend scales of students' motivation towards mathematics.Following the model proposed by Eccles and Wigfield (2002), we distinguished motivation by its source.When individuals engage in an activity for instrumental reasons, i.e., receiving a reward, they are extrinsically motivated.Nevertheless, when individuals engage because they enjoy the activity itself, they are intrinsically motivated.We investigated the trend component of these two scales (i.e., intrinsic-and extrinsic motivation) via confirmatory factor analysis (CFA) and item response theory (IRT) scaling methods, as well as applied a market-basket approach scaling, while we studied relevant characteristics related to measurement bias and longitudinal linking.

Measurement bias and equivalence
The research of student outcomes across countries needs to consider cultural differences and the possibility of measurement bias.This seems obvious regarding the cross-cultural measurement of affective constructs; nevertheless, researchers face numerous statistical challenges.We employed a methodological framework proposed by van de Vijver (2018) to describe the types of bias and equivalence in the context of cross-cultural assessments measuring affective constructs.He identified three types of bias based on their sources: construct-, method-, and item bias.The presence of construct bias indicates that the construct measured is not identical across cultures.Method bias refers to confounding factors that originate in the sampling, structural characteristics of the instrument, or administration.Finally, an item is biased when it has a different psychological meaning across cultures.
In this framework, van de Vijver (2018) defined measurement equivalence of scales by the level of comparability and three types can be distinguished: construct-, measurement unit-, and full score equivalence (van de Vijver & Leung, 1997;van de Vijver, 2015van de Vijver, , 2018)).Construct equivalence is fulfilled when the same theoretical construct is measured in each group, i.e., configural invariance holds.Measurement unit equivalence corresponds to metric invariance, i.e., the scales have the same measurement unit but different scale origins.Finally, full score equivalence means the same as scalar invariance, i.e., the scales have the same measurement unit and origin.
Measurement equivalence or invariance can be tested in the structural equation modeling (SEM) framework with a measurement model applying a CFA approach as the psychometric equivalence of a construct across groups or in the IRT framework as the lack of differential item functioning (DIF).Putnick and Bornstein (2016), in their extensive review highlighted that the SEM framework using CFA is more commonly used than IRT.Numerous researchers (e.g., D' Urso et al., 2020;Kim & Yoon, 2011;Meade & Lautenschlager, 2004) have compared the two approaches and provided recommendations for measurement invariance testing in different assessment contexts.However, most of these studies were based on simulated data.
A recent report of an OECD conference on the cross-cultural comparability of questionnaire measures in large-scale assessments (van de Vijver et al., 2018) provided a broad and up-to-date discussion regarding techniques to investigate measurement invariance.Concluding the conference, Avvisati et al. (2018) pointed out that several participants observed how the distinction between the CFA and the IRT worlds is largely artificial, and, despite the most rigorous application of preventive measures, the assumption of full comparability of measurement instruments in ILSAs cannot be upheld (see also Davidov et al., 2014).The report indicated a consensus among participants that any procedure to address the possible violation of full measurement invariance needs to consider the non-comparability of scales as a possibility.This possibility imposes great challenges and potential limitations on longitudinal linking.

Scaling affective items in international large-scale assessments
The TIMSS context questionnaire scales for trend measurement were constructed with IRT scaling using the Rasch partial credit model (PCM; Martin et al., 2016;Masters, 1982;Yin & Fishbein, 2020).To evaluate the context questionnaire scales, the Cronbach's alpha coefficient measuring internal consistency was computed for each scale for every educational system, and a principal component analysis of the scale items was conducted.Measurement invariance across countries was not evaluated, however, the scaling was done with a single-group design.
Questionnaire data surveying latent constructs may also be scaled within the SEM framework, with a CFA measurement model.An example of an ILSA employing CFA for scaling affective items is the Teaching and Learning International Survey (TALIS) administered by the Organisation for Economic Co-operation and Development (OECD) since 2008.The comparability across participating educational systems in all three cycles of TALIS was evaluated by measurement invariance testing with the multiple-group confirmatory factor analysis (MGCFA; Organisation for Economic Co-operation & Development, 2019).

Longitudinal linking
The process of adjusting, via statistical methods, two tests with differences in content or difficulty is known as linking.There is extensive research on linking cognitive outcomes in ILSAs over time, with various linking approaches.Linking can be achieved using IRT linking methods (see e.g., Afrassa, 2005;Johansson & Strietholt, 2019;Majoros et al., 2021;Strietholt & Rosén, 2016), which require a set of common items across tests among other preconditions.There have been also several attempts to link test scores from different regional, national, or international assessments assuming similar target populations and representative samples over a long period.These linking studies rely on IRT within the assessments and classical test theory across them because of the limited amount of overlapping items (see e.g., Altinok et al., 2018;Chmielewski, 2019;Hanushek & Wößmann, 2012).

The current linking practice in TIMSS for affective scales
Certain context questionnaire scalesconstructed with IRT scalingthat maintained many of the same items across TIMSS 2011, TIMSS 2015, and TIMSS 2019 (see Martin et al., 2016;Yin & Fishbein, 2020), were linked through a two-step transformation process by applying the mean/sigma method.The first transformation placed the TIMSS 2019 logit scale scores on the TIMSS 2015 logit metric by applying the procedure described by Marco (1977) and referred by Kolen and Brennan (2014) as the mean/sigma method to the two sets of common item parameters.These sets were estimated by the separate calibration of TIMSS 2019 data and the TIMSS 2015 data.The mean and standard deviation of the estimates of the threshold parameters (Masters, 1982), i.e., the difference between item location and item step parameters, were used for all common items and all categories for each calibration.The second step was to transform the TIMSS 2015 Rasch logit scores on the TIMSS scale reporting metric (mean:10, standard deviation: 2).To assess the accuracy of the linking, item parameter estimates for the common items were compared across the two cycles by examining the differences between the TIMSS 2019 item parameter estimates after being transformed to the TIMSS 2015 logit metric and the TIMSS 2015 item parameter estimates on the 2015 logit scale.This linking procedure assumed full measurement invariance across countries at each time point.

The present study
In terms of method bias, we built on previous research (Majoros et al., 2021;Majoros et al., 2020) evaluating the comparability across the respective assessments, i.e., SIMS and every cycle of TIMSS between 1995 and 2015.To determine their overall similarity, the inferences, populations, measurement characteristics (i.e., method bias), and constructs (i.e., construct bias) were explored based on the scheme proposed by Kolen and Brennan (2014).A sufficient degree of overall similarity was found, therefore, we assumed that method bias was not severely impacting the trend scales in the present study.
To evaluate construct bias across countries, we followed the guidelines proposed by Svetina et al. (2020) to test measurement invariance across countries at each time point.They focused on selected solutions by Wu and Estabrook (2016) in terms of model identification and invariance testing.Thus, after establishing configural invariance, threshold invariance was tested first, followed by invariance testing for factor loadings.This approach differs from current practices of conducting measurement invariance testing, where a baseline model is established first and increasing parameter restrictions are subsequently imposed.
Item bias over time was evaluated focusing on the anchor items between time points.We used Angoff's delta plot method (Angoff & Ford, 1973) for investigating item parameter drift between time points.The delta plot is a score-based method that compares the proportions of correct responses in the reference group and the focal group.Items are flagged as biased when they change relative to the set of all items in the test.Magis and Facon (2014) argued that the main benefit of using relative methods is that the identification of problematic items relies on the particular items themselves.Moreover, we have a small number of anchor items and our major interest is in their overall trend.
We then performed the linking and scaling of the data with three methods applying different sets of assumptions.Since we attempted to link non-identical sets of items measuring the same constructs over time, only subsets of items were bridging over the assessments.Therefore, we made use of latent variable modeling in the IRT and SEM frameworks.In addition, we proposed a third alternative based on the manifest probabilities and plausible scores, the market-basket approach.
First, the IRT linking was achieved by concurrent calibration (Wingersky & Lord, 1984) of all items in all studies, thus the parameters estimated for each test were automatically put on the same scale.We have chosen the concurrent procedure because this method provides E. Majoros et al. smaller standard errors and involves fewer assumptions than other IRT procedures, and good linking may be achieved with as few as five common items or less (Wingersky & Lord, 1984).Item parameters were estimated simultaneously while the parameters of the anchor items were assumed identical across all time points and educational systems.We compared the PCM model applied in TIMSS with the generalized partial credit model (GPCM; Muraki, 1992).Second, in the SEM approach, we fit a single-group CFA model for each motivation scale and the estimate factor score for scaling the data.We assumed strong invariance across countries and over time.
Third, instead of reporting estimates on latent variable scales, we used a market-basket approach proposed by Zwitser et al. (2017).The main idea was that the constructs are defined as a large set of items, data are collected with subsets of items and reported in terms of summary statistics.To deal with incomplete data, we fit a measurement model (e.g., an IRT model) to generate plausible responses.When applying this approach, we assumed that the anchor items' parameter estimates are invariant over time within countries.To account for cross-cultural DIF, we fit a separate model per country.Another assumption was that, for each time point, the market basket of items in the survey represented the construct.Finally, we reported the results based on summary statistics, in this case, the expected sum scores over the completed set of responses, i.e., plausible scores.

Data
The present study focused on grade eight (or equivalent) student questionnaire data in seven ILSAs on mathematics administered by the International Association for the Evaluation of Educational Achievement (IEA).Hence, we pooled the data of SIMS, administered in 1980, and all six cycles of TIMSS administered in every fourth year from 1995 to 2015.The data of SIMS were gathered from the Center for Comparative Analyses of Educational Achievement website (COMPEAT2 ).Data and documentation of the TIMSS studies were downloaded from the IEA Study Data Repository. 3e have selected the six educational systems that have participated in all time points: England, Hong Kong, Hungary, Israel, Japan, and the United States.The sample sizes are presented in Table 1.We can observe that in 1995, two adjacent grades were sampled in each country except for Israel.The sample size differences were taken into account with the use of senate weights.More details are provided in the analytical steps section.

Items
The items included in the present study correspond to intrinsic-and extrinsic motivation towards mathematics included in the students' questionnaire for each assessment (Appendices A and B).The overlapping items along with their variable names are presented for each scale in Tables 2 and 3. We can observe that there are identical and similar items across assessments.Nevertheless, item wordings have changed over time in some cases.In a few instances, it also meant shifting from positively worded statements to negatively worded items.These changes might influence comparability (for examining the effects of item wording changes see e.g., Dedrick et al., 2007;Schuman & Presser, 1996).
The number of overlapping items is summarized in Table 4. Overall, the pooled extrinsic motivation scale consisted of 15 items, while the intrinsic motivation scale comprised of an item pool of 19 questions.
The students had four response options to choose from in the case of all items in all TIMSS cycles: strongly agree, agree, disagree, and strongly disagree (the wording refers to 1995).However, in SIMS, they had a middle option: undecided.The proportion of undecided responses in the analyzed countries are shown in Tables 5 and 6.It is interesting to observe how these proportions vary between countries.In most of the cases, the Japanese students used the middle option considerably more frequently than students in other countries.Interestingly, the only exception was the item "I would like to work at a job that lets me use mathematics".To this item, only half of the Japanese students responded undecided compared to the other countries or the other items within the intrinsic motivation scale.This could be a preliminary indication of measurement non-invariance among the educational systems.Due to the considerably large portion of middle responses, we have decided not to treat these responses as missing values.We can also observe that in most of the cases, the proportion of middle responses was the highest in Japan.We recoded these responses to random answers between the options agree and disagree.There were some cases when a student selected the middle option for all items.We excluded these cases, 0.95% of the sample for the extrinsic-and 0.71% of the intrinsic motivation scale.

Missing data
The proportion of missing responses ranged from 0.05% to 2.11% in the extrinsic-and 0.14-1.76% in the intrinsic motivation scales.One item presented only missing values for the Japanese sample in 1995.

Internal consistency of the scales
The internal consistency of the motivation scales varied across educational systems and over time.Appendix C shows the Cronbach's alpha coefficients that range between 0.47 and 0.94.The Japanese data from SIMS displayed unacceptable coefficients for both scales.Apart from these values, in most instances, the reliability was acceptable (>0.70;Cortina, 1993) and in all cases above.61.

Comparability
Cross-cultural comparability.To test measurement invariance across countries, we performed an MGCFA for each time point, using Mplus 8. Students were grouped by country and the first step was to identify the baseline model and testing for configural invariance among countries.After establishing configural invariance, threshold invariance was tested, followed by invariance testing for factor loadings.The questionnaire items were treated as categorical variables and we followed the procedure outlined by Svetina et al. (2020).The WLSMV estimator was used to estimate factor models.This method produces a weighted least square parameter estimate by using a diagonal weight matrix, robust standard errors, and a mean-and variance-adjusted χ 2 test statistic (Brown, 2015).
Longitudinal comparability.We used Angoff's delta plot method (Angoff & Ford, 1973) for the detection of the item parameter drift between time points using the deltaPlotR package (Magis & Facon, 2014) for the statistics environment R (R Core Team).Under this method, the proportion of responses indicating positive endorsement are compared between the two groups.If there is no item parameter drift, these proportions should be located on a diagonal line.Items that are separated from that diagonal are flagged as biased items.For this step, we recoded the answers strongly agree to agree (1) and strongly disagree to disagree (0).Following the suggestion of Magis and Facon (2014), the threshold was derived by using a normality assumption on the delta points.Each item j has a pair of delta scores (Δ j0 , Δ j1 ), referred to as the delta point.These delta points can be displayed in a scatter plot, called the diagonal plot, with the delta scores of the reference group on the X-axis and of the focal group on the Y-axis.The plot usually takes the form of an elliptical cloud of delta points.The items that substantially depart from the main axis of this ellipsoid can be flagged as DIF.
The major axis is computed with the following equation: in which • a is the intercept and in which • x 0 and x 1 are the sample means of the delta scores, • s 2 0 and s 2 1 are the sample variances, and • s 01 is the sample covariance of the delta scores.
The perpendicular distance D j between the major axis given in equation ( 1) and the delta point (Δ j0 , Δ j1 ), is computed as follows:  pooled sample composed of data from all countries and cycles.We assumed strong invariance of the anchor items across countries and over time.We used the estimated factor scores applying maximum likelihood estimation with robust standard errors (MLR) and the full information maximum likelihood (FIML) estimation in Mplus as a means of handling the missing data, while the items were treated as categorical variables.
For the responses missing by design, we applied the pattern function in Mplus.This does not work together with the WLSMV estimation, but for items with more than three response options, the superiority of WLSMV over maximum likelihood estimation is less clear (Beauducel & Herzberg, 2006).We then transformed the factor scores onto a scale with a mean of 5 and a standard deviation of 1.According to the SEM framework, an item y is predicted from the latent factor η as it is shown in the following equation: in which • τ denotes the vector of item intercepts, • Λ is the vector of factor loadings, and • ε is the vector of residuals.
To estimate factor models from ordinal items, the MLR estimation procedure for continuous latent constructs was used because it is robust to non-normality.Mplus uses the maximum of the posterior distribution of the factor, which is known as the maximum a posteriori method (Muthén & Muthén, 1998-2017).The factor score estimate η for individual i is based on a regression method with correlated factors, where the factor score is computed as follows: in which • μ is the mean vector of y items, • C is the factor score coefficient matrix, and • v i is the vector of observations.
IRT scaling.First, we compared two models using the R package mirt (Chalmers, 2012), employing an expectation-maximization algorithm to achieve marginal maximum likelihood estimates of the item parameters and person scores as outlined by Bock and Aitkin (1981).In the first model, item parameters were estimated using the PCM following the scaling procedure in TIMSS.The PCM gives the probability that a student with proficiency θ s will have, for item i, a response x is that is scored in the l th of m i ordered score categories as: in which • x is is the response of student s to item i (0 or 1 if correct), • θ s is the ability of student s, • b i is the location/difficulty parameter of item i, • m i is the number of response categories for item i, and • d i,l is the category l threshold parameter of item i.
In the second model, item parameters were estimated using the GPCM Muraki (1992).The fundamental equation of this model gives the probability that a student with proficiency θ s will have, for item i, a response x is that is scored in the l th of m i ordered score categories as: in which a i is the slope/discrimination parameter of item i.
The model comparison showed that the GPCM model fit the data better for both scales.The Akaike information criteria (AIC; Akaike, 1974) and the Bayesian information criteria (BIC; Schwarz, 1978) were calculated.Both information criteria indices indicate the better fit of the GPCM model that allows the items to vary in terms of discrimination in contrast with the PCM model employed in the TIMSS contextual trend scales.
The item parameter estimation was conducted by concurrent calibration of all items in all studies, thus the parameters for all tests are automatically put onto the same scale.Item parameters were estimated simultaneously while the parameters of the anchor items were assumed identical in each sample.Third, we estimated the person scores and transformed them onto a scale with a mean of 5 and a standard deviation of 1.
Market-basket approach.The market-basket approach assumes that the items included in the assessment or survey define the construct.In this case, the assumption was that all the items from across the time points, related to intrinsic and extrinsic motivation towards mathematics, define each construct and can be considered as a market basket of representative items.Here we did not have an incomplete design, but the missing responses occurred as a consequence of changes in the questionnaires across cycles.We followed the procedure described by Zwitser et al. (2017) using a measurement model per country as a tool to generate plausible responses and fill the missing responses related to items that were not included in each cycle.Using the item parameters estimated by fitting the measurement models, we imputed missing responses five times per respondent and calculated sum scores, thereby estimating five plausible scores.
It is worth pointing out three aspects of this method.Firstly, an IRT model is not required, and any kind of measurement model can be used to generate plausible responses.We have used a GPCM model for consistency with the TIMSS procedure.Secondly, this GPCM model was employed for each country separately, so differences among countries (i.e., DIF) did not influence the generation of plausible responses.Thirdly, the results and comparisons were based on a sum score over the market basket of representative items and not on estimated latent variables.In this way, the comparability across countries was not threatened by differences between them.We transformed the plausible scores onto a scale with a mean of 5 and a standard deviation of 1.
Observed Scores.We used the sum of the observed scores per person at each time point and divided them by the number of answered items.A higher score indicates a more positive attitude.We then standardized these scores considering a mean of 5 and a standard deviation of 1 for the whole sample.We used this scale for presenting the results of the scaling.
Weights.Senate weights that sum to 500 for each country's data were applied (stratum weights in SIMS were rescaled to senate weights) in the scaling procedures, thus, the sample size differences of each country were taken into account and each country contributed equally to the estimation of the scales.In TIMSS 1995, there were two grades sampled in each country except for Israel, thus, we rescaled senate weights so that each grade was weighted equally within a country.

Cross-cultural comparability
We tested measurement equivalence across countries at each time point.The MGCFA invariance testing of SIMS revealed that four items out of eight in the extrinsic motivation scale had negative factor loadings in the case of Japan in the baseline model.The same pattern emerged from fitting the baseline model in the case of the intrinsic motivation scale, five items out of 11 showed negative factor loadings.We concluded that measurement invariance does not hold for Japan and continued all further analyses excluding this country.
The thresholds and loadings equality constraints yielded acceptable model fit at most time points for the five-country multiple-group model as presented in Appendix D, However, the root mean square error of approximation (RMSEA; Steiger & Lind, 1984, as referred in Brown, 2015) values are in many cases too high while the absolute and comparative fit indices are mostly acceptable (except for the sample size-sensitive χ 2 values).A possible explanation for poor RMSEA values is that this absolute fit measure is considered as a parsimony correction index, and it yields in poor fit because we have relatively high numbers of freely estimated parameters in these models.In addition, the poor relative fit indices (CFI, TLI) in the early assessments could be attributed to the mixedworded scales.The presence of negatively-worded items potentially causes one-dimensional CFA models to show a poor fit (e.g., Marsh, 1996;Steinmann et al., 2021;Woods, 2006;Zhang et al., 2016).Finally, as Shi and Maydeu-Olivares (2020) argue, model fit values are influenced by many factors, such as estimation method or categorical/continuous specification and they suggest using only the SRMR because it is more consistent across these factors.We concluded that the measurement of the constructs included in the present study was invariant across countries at each time point.However, this does not imply that there is full invariance across time points.

Longitudinal comparability
We evaluated the assumption of the invariance of the anchor items across time by employing the delta plot method for each bridge, i.e., consecutive time points.The tests were conducted for each country separately as well as the pooled data and all these tests yielded no items flagged for bias.For simplicity, the plots of the pooled data are shown in Appendices E and F.

Trend scaling
We treated the countries as a single group in the CFA and IRT scaling procedures and separate in the market-basket scaling.The three methods yielded similar results on the individual-as well as the country levels.The correlations between individual scores were high across methods for both motivation constructs, ranging between 0.96 and 1. Figs. 1 and 2 show the country-level means for extrinsic-and intrinsic motivation by scaling methods.It is striking in the country-level trends that both models (CFA and IRT) with assumptions for full cultural-and longitudinal-invariance resulted in very similar results to the observed scores.The market-basket approach was employed to account for differences across countries, but the trend results did not show large deviations.
To explore a recently proposed method that does not rely on latent variable modeling and measurement invariance across countries, we used the market-basket approach.The three methods produced similar results, which, in the case of the CFA and IRT framework, is not surprising.As we mentioned previously, the market-basket approach can be combined with any of these measurement models to get plausible responses.We showed that the correlations between individual scores were high across scaling approaches.Because of the stratified multistage sampling design used in TIMSS, the simple random sampling assumed in the procedure for calculating standard errors of estimates does not apply (Rutkowski et al., 2010).Therefore, a limitation of the trend scales is that we have underestimated the standard errors of the means.

Discussion and conclusions
We investigated the trend component of two affective scales (i.e., extrinsic-and intrinsic motivation) by employing three different scaling methods, while we explored relevant characteristics related to measurement bias and longitudinal linking.We applied two widely used latent variable modeling approaches on real data drawn from mathematics ILSAs spanning over 35 years.We tackled issues of cross-cultural measurement of affective constructs addressing the issues of method-, construct-, and item bias.We showed how the assumptions regarding measurement invariance affected the analytical process when we excluded Japan from the latent variable analyses.The popularity of the middle option in Japan in 1980 was consistently and considerably high, indicating a possible cultural difference compared to the other countries.
We performed the analyses for five countries that participated in the seven cycles of SIMS and TIMSS.It is worth mentioning that the analysis can be extended to the other cycles and participating countries.One additional aspect to consider in that extension is how each method can provide results when new data is included in the analysis.For instance, the market-basket approach can be applied to include more data in longitudinal scales without the need for recalibrating the original scales or employing equating methods.
We cannot ignore the fact that the affective scales analyzed here until recently had not been designed for trend measurement.Hence, the maxim introduced by Beaton and Zwick (1990): "When measuring change, do not change the measure" (p.10), did not entirely apply to the data in this exploratory analysis.The scales varied in length and the number of anchor items between them is smaller in the early years.Furthermore, modifications occurred in the number of response options from 1995.The way of handling the middle option in SIMS posed a limitation on the study.Another limitation of the study from the aspect of cross-cultural measurement over time is the possibility of changes in the translation and cultural adaptation procedures.The psychometric validation of the attempted scaling could be extended by inspecting the instruments in their original language.Finally, one of the most challenging remaining questions is whether changes in wording affect the internal relationships among items (e.g., factor structure).Since we explored non-identical sets of items over time, the number of items at almost each time point varies, which makes the investigation of the effects of changes in item wording challenging.
We believe that despite these challenges, the old international mathematics studies -SIMS and even the First International Mathematics Study (FIMS) administered in 1964 -provide rich data for secondary analyses.It is important to evaluate the possibilities of linking these studies to the recent ones because the potential country-level longitudinal analyses that can stem from such trend scales might serve as powerful approaches to investigate causal research questions.Future research on taking a closer look at the changes over time could reveal mechanisms in the relationship between motivation and achievement across countries.For instance, the relative proportion of females choosing a mathematical track in upper secondary and higher education or STEM-related professions is still unreasonably low and unrelated to mathematics achievement in many countries.It is potentially interesting to explore the relationship between the (decreasing) trends of the gender gap in mathematics achievement (Mullis, Martin, & Loveless, 2016)  a This model is just-identified with zero degrees of freedom.Model fit cannot be assessed in this case.

Appendix E. Delta plots of the extrinsic motivation scale bridges
See Appendix section here .

E
.Majoros et al.

Table 2
Common Items of the Extrinsic Motivation Scales in SIMS and TIMSS.

Table 3
Common Items of the Intrinsic Motivation Scales in SIMS and TIMSS.

Table 4
Number of Selected Items in the Affective Scales of the Respective Studies.

Table 5
Percentage of Middle Responses in SIMS, Extrinsic Motivation Scale.

Table 6
Percentage of Middle Responses in SIMS, Intrinsic Motivation Scale.