Measurement Invariance of the Prosocial Behavior Scale in Three Hispanic Countries (Argentina, Spain, and Peru)

In a growing context of multiculturalism, prosocial behavior is important to build effective social exchange and service orientation among university students. The present study investigates prosocial behavior from a psychometric approach, to obtain evidence of the internal structure of the prosocial behavior scale (PS), in 737 young people enrolled at universities in Argentina (207), Spain (310), and Peru (220). First, the clarity of the items was explored in the three countries; second, possible irrelevant patterns of response, such as the careless and extreme responses, were evaluated; third, the non-parametric Mokken methodology was applied to identify the basic properties of the scale score; fourth, the structural equation modeling (SEM) methodology was used to identify the properties of the internal structure (dimensionality, tau-equivalence) of the latent construct; fifth, the measurement invariance according to sex (intra-equivalence) and country (inter-equivalence) was examined with the SEM methodology and other complementary strategies. Finally, reliability and internal consistency were evaluated both at score level and at item level. Implications for use of the PS instrument are discussed.


INTRODUCTION
Prosocial behavior includes those actions tending to help or benefit other people, irrespective of the intention to be pursued with this help. Such behavior is the result of multiple individual and situational factors including parental variables and empathic traits (Eisenberg and Fabes, 1998). It is understood as a tendency to give rise to actions, belonging to the sphere of habits, practices and social interactions, that are characterized by the beneficent effects they produce on another person (Caprara, 2005). Moreover, Roche (2010) argued that truly prosocial behavior consists of help given to other people or groups in the absence of extrinsic or material reward. There are several different types of actions that make up prosocial behavior, such as physical and verbal help, material giving, verbal comfort, confirmation and positive appreciation of the other, deep listening, empathy, and solidarity, as well as the expression of unity with others (Roche, 1999).
Research on prosociality in diverse cultures has increased over the last few decades (Murakami et al., 2016;Luengo et al., 2017;Rodriguez et al., 2017;Gerbino et al., 2018). This has allowed researchers to carry out several meta-analysis studies on prosociality (Malti and Krettenauer, 2013;Shariff et al., 2016;Mesurado et al., 2019b), that show the value of clinical and educational interventions in encouraging prosocial behavior. For example, based on their own meta-analysis, Mesurado et al. (2019b) concluded that intervention programs aimed at promoting prosocial behaviors showed moderate effectiveness, while intervention programs focused on the prevention of aggressive were highly effective.
Since the construct of prosociality implicates a wide range of different behaviors, its measurement distinguishes between indicators of global prosocial behavior and prosocial behavior expressed in specific situations (Carlo and Randall, 2002). Measures of global prosocial behavior are defined as measures that evaluate personal tendencies to exhibit a series of prosocial behaviors across diverse social contexts and for different motives. An example of this type of global measure is the Prosociality Scale of Caprara et al. (2005). These global measures tend to characterize certain people as prosocial, distinguishing them from others who are not. However, global measures have limited application in research, since they do not investigate possible moderators such as in-group and out-group effects on tendencies to help, among other contextual factors. In contrast, measures of prosocial behavior in specific situations can provide information about more tightly delimited conceptualizations of prosociality, as well as supporting the elaboration and intercorrelation of different types of prosocial behavior. One example of this point is research that distinguishes between different recipients of aid, in terms of measuring the prosociality directed toward relatives, friends and strangers in adolescent populations (Padilla-Walker and Christensen, 2011;Padilla-Walker et al., 2015;Mesurado et al., 2019a). Such specific measures see prosociality as a multidimensional construct, which can be a very beneficial approach when studying interactions between prosociality and other variables (Carlo and Randall, 2002). However, the usefulness of a global or specific approach to measuring prosociality is not intrinsic to the measure itself, but is conditioned by the purpose of its use in basic or applied research, or in professional practice.
Another example of global prosociality measures is the Prosociality Scale (PS; Caprara et al., 2005), which describes the individual variability of prosocial behavior as a stable attribute, and is designed for young adults. It consists of 16 items to answer on an ordinal scale of 5 options ranging from "never/almost never" to "always/almost always." Based on the original study by the instrument's authors , we can distinguish psychometrically the items that provide high information (items 3, 5, 7, 8, 10, 12, and 13), moderate information (items 4, 6, and 9) and low information (items 1, 2, 11, 14, 15, and 16). The PS has had some international diffusion, with studies in various countries. For example, investigations have been conducted with Colombian adolescents using the reduced version of the scale (Luengo et al., 2017). Studies have also been carried out in Japan (Murakami et al., 2016), and in a sample of Argentinian adolescents (Rodriguez et al., 2017). In the latter study a confirmatory factor analysis arrived at a scale of two dimensions (prosocial behavior and empathy and emotional support) while reducing the number of items to 10, and achieving an internal consistency of α = 0.78. Cross-cultural work has also been carried out on samples of children from Colombia, Italy, Jordan, Kenya, the Philippines, Sweden, Thailand, and the United States (Pastorelli et al., 2016), although data on the reliability and validity of the Prosociality Scale instrument were not presented in that study. It is worth mentioning that the aforementioned studies were carried out on children and adolescents, an age range for which the scale of Caprara et al. (2005) was not specifically designed. Their results should therefore be interpreted with caution, and should not automatically be generalized to adult populations.
Since the Prosociality Scale is recognized internationally, it is of great scientific and practical interest to evaluate its psychometric characteristics and variance across diverse populations. Additionally, studies that use a version of the scale in Spanish are particularly valuable since they are moderately scarce compared to studies that a use a version in English. Indeed, a recent systematic review of measures of prosocial behavior (Martí-Vilar et al., 2019) reported that PS is among the measures with few validation studies carried out adults, but with excellent internal consistency. The relationship between the importance of the construct and the its measurement in adults does not seem to be isomorphic, since there are few validation studies of internal structure and correlation studies with other relevant constructs: except for a study by Rodriguez et al. (2017), this information is practically absent in the Ibero-American population. These authors performed a confirmatory factor analysis on a population of Argentinian adolescents. In their study, a 10-item model with two dimensions was obtained, namely prosocial behavior on the one hand and empathy and emotional support on the other. In turn, they analyzed the convergent validity of the instrument, obtaining significant correlations with some dimensions of the scale of prosocial tendencies produced by Carlo and Randall (2002).
Investigations that have used the Prosociality Scale have rarely addressed certain aspects that could help to understand its psychometric functioning. For example, the functioning of the items within a tau-equivalent model has not been analyzed; this property is a condition for the use of the reliability coefficient type (Graham, 2006;Trizano-Hermosilla and Alvarado, 2016), as well as for identifying the homogeneity of the representation of the content and interpretation of the score. In this sense, because the factor loads signify the strength with which the items are connected to (represent) the latent construct (Trizano-Hermosilla and Alvarado, 2016), the similarity or dissimilarity of factor loads can influence interpretation of the score. Therefore, different factor load patterns (e.g., item 1: 0.80, item 2: 0.50, item 3: 30, item 4: 0.30; compared with item 1: 30, item 2: 30, item 3: 50, item 4:0.80), may not lead to the same interpretation of the construct.
On the other hand, all studies that have used the Prosociality Scale (except Caprara et al., 2005) have applied linear models that included latent variables (in other words, structural equation modeling, or SEM); however, a deeper analysis of the instrument requires considering that the interpretation rests on the score observed, and therefore a non-parametric methodology that uses the observed score as the main reference for the adjustment of the items may be necessary, and a prerequisite for the application of parametric models such as linear SEM modeling (Dima, 2018). The sequential or joint application of several procedures to identify the psychometric properties of a measure can be better understood within a framework of sensitivity analysis, in which the results of various methods or modifications of the data are contrasted, in order to evaluate the eventual convergence. This has been especially applied in the investigation of equivalence of measures (Hambleton, 2006;Teresi et al., 2009) and adaptation of evidence (Dima, 2018). Finally, due to the different informative strength of each PS item (as found by Caprara et al., 2005), it is plausible that each item is differently sensitive to factors such as sex; in this sense, the differences between groups in the means can mask fine differences at the item level. More precisely, descriptive analysis at the item level is relevant because each unit represents an elementary behavior of the intended construct, and its statistical behavior can help to better understand this, and precede the use of advanced analyses (Dima, 2018). Additionally, due to the apparent tendency to use single-item scales in selfreport and epidemiological investigations, information at the item level can contribute to more informed choices in such uses.
The aim of the present study was to evaluate the psychometric functioning of the Prosocial Conduct Questionnaire in a context of intercultural use, focused on university participants from three Spanish-speaking countries: Argentina, Spain, and Peru. Specifically, the central objective was to obtain evidence of the validity of the internal structure of the Prosocial Behavior Questionnaire in three Hispanic countries, through the exploration of scalability, dimensionality, invariance of measurement and reliability of internal consistency. The aspects evaluated in this study may be specific to their use in these countries, and are linked to the evidence on the internal structure of the scale, which is a key component for other sources of evidence of validity (Lewis, 2017). Dimensionality, invariance and reliability can be considered fundamental contributors to the valid interpretation of a score, and together define an instrument's internal structure (Rios and Wells, 2014); that is, the theoretically coherent relationship between the components of a measure that serve as a basis for the interpretation of the score (American Educational Research Association [AERA] et al., 2014). Accordingly, evidence of validity based on the internal structure is critical in conditioning other evidence of validity (Ziegler and Hagemann, 2015). In the present study, scalability was also evaluated as a property of the score for establishing ordinal differences between subjects based on their observed scores (Mokken, 1971;van Schuur, 2003;Smits et al., 2012). This aspect is not necessarily equal to the dimensionality of an instrument, and therefore must be evaluated in a complementary way (Smits et al., 2012), usually with the non-parametric approach of Mokken (1971). The equivalence or invariance of measurement, as well as the similarity of internal consistency, and the sex differences in the level of total score and individual item, were also considered. Apparently, this is the first study that tests the dimensionality and invariance of the Prosociality Scale in several Ibero-American countries, and thus represents an advance toward the global use of the instrument.
The total age in the sample was: M = 21.42, SD = 4.11, Min = 16, Max = 53); between the samples (Argentina: M = 20.67, SD = 2.88; Spain: M = 21.66, SD = 4.35; Peru: M = 21.79, SD = 4.66), the differences were statistically significant (F[2,733] = 4.926, p < 0.01) but the effect size (ω 2 = 0.01) was very small (Field, 2013). The differences between distribution of semesters in Peru and Spain (Kolmogorov-Smirnov D = 0.386, p < 0.01), and Peru and Argentina (Kolmogorov-Smirnov D = 0.433, p < 0.01) were statistically significant, while those for Spain and Argentina were not (Kolmogorov-Smirnov: D = 0.084, p > 0.10). But the practical significance of these differences, in terms of similarity of frequencies (overlap, PSR, Rom and Hwang, 1996) tended to be high: PSR Peru-Spain = 80.7%; PSR Peru-Argentina = 78.4%; PSR Spain Argentina = 95.8%. According to previous studies of the validation and substantive use of the instrument in the adult population (Murakami et al., 2016;Pastorelli et al., 2016;Luengo et al., 2017;Rodriguez et al., 2017), the various sub-samples of our participants were not differentiated from one another in relation to sampling (non-probabilistic), coverage (young adults), or main activity (university studies), and therefore they can be thought of as generally aligned.

Demographic Sheet
A questionnaire was compiled to gather sociodemographic information, namely country, city, age, sex, level of studies, and academic semester.
Prosociality Scale  This is a self-report measure that quantifies prosociality as a stable attribute in the adult population. It consists of 16 ordinally scaled items each with five response options. The response instructions posit a generic and timeless context of prosocial behaviors. In relation to the internal consistency of the instrument, the original authors reported unidimensionality, a wide range of psychometric precision, internal validity of the items, and internal consistency of α = 0.91 . The Spanish version used here come from Rodriguez et al. (2017) for the Argentinian population.

Data Collection
The study was authorized by the Ethics Committee of the Universitat de València. Participants were contacted at universities in Argentina, Peru, and Spain. If they wished to participate in the research, they were sent a link to an electronic form, where they had to complete a process of informed consent to answer the questionnaires. The entire sample was collected online.

Analysis
The analysis was divided into analysis of irrelevant answers, descriptive analysis of item responses, content validity testing on the clarity of the items, scalability of the score and the items, dimensionality of the score, internal consistency of the reliability estimates, and invariance and measurement equivalence.

Inattentive/irrelevant responses to content
For the present study, inattentive and irrelevant responses were explored, because answering questionnaires through a web platform has generally been associated with this type of irrelevant response pattern (Johnson, 2005). To identify this problem, the distance D 2 (Mahalanobis, 1936) was used to identify subjects who behaved as multivariate outliers; and to confirm this identification, the variability of intra-individual response was examined (IRV; Dunn et al., 2018). Both are effective techniques for this type of problem (Meade and Craig, 2012) and were implemented using the careless program (Yentes and Wilhelm, 2018).

Descriptive information
Tests of normality related to symmetry (D'Agostino, 1970) and kurtosis (Bonett and Seier, 2002) were used, as well as descriptive statistics to identify the floor and ceiling of each item.

Content validity
This part of the analysis highlighted the clarity of the content. The version of questionnaire used as a baseline of content was validated by Rodriguez et al. (2017). An independent evaluation of the content carried out by the authors indicated that it was phrased without apparent local expressions, and seemed generalizable across the participating groups. However, as (a) Spanish speech is generally characterized by local variations in the use of some words, and (b) there may be discrepancies in assessing clarity between expert judges and the participants themselves (Merino-Soto, 2016), we first corroborated whether the phrasing of the items was clear to the participants. For this purpose, they were given a score clarification form for the items. Each participant read the instructions first, and then scored each item using an ordinal scale of five points, from Not clear (1) to Completely clear (5). The ratings were analyzed using the V coefficient (Aiken, 1980), and their asymmetric confidence interval was computed using the ICAiken program (Merino-Soto and Livia, 2009). This coefficient is often used in content validity studies, to quantify the convergence of qualifying judges between values of 0 (absence of consensus) to 1 (complete consensus). To compare the perceived clarity between the three groups (Argentina, Spain, and Peru), a confidence interval of the difference between the V coefficients was applied (Merino-Soto, 2018). Acceptable clarity was established when the score estimates and the lower limit of the interval were above or equal to 0.60 (Merino-Soto and Livia, 2009).

Non-parametric analysis of scalability
To evaluate the fundamental properties of the instrument scores (Brodin, 2014), regardless of the strong presumptions of the latent variable models, a non-parametric approach (Mokken, 1971) was used to analyze the ordinal items of the Prosociality Scale (Molenaar and Sijtsma, 1988). This approach examines the ability of a score to differentiate the ordinal rank of the subjects or items of a measure. Its results are a prerequisite for more demanding parametric approaches (Brodin, 2014;Dima, 2018). There are several useful guides for conducting the analysis with the Mokken approach (e.g., Stochl et al., 2012;Watson et al., 2012;Sijtsma and van der Ark, 2017;Palmgren et al., 2018), but all converge on examining three basic properties for the completion of the monotonic homogeneity model (MHM; Sijtsma and van der Ark, 2017): (a) scalability of the items, using the H coefficient (Loevinger, 1948); (b) local independence, in which the responses to the items are not mutually influenced, examined by three conditional association indices, W (1) , W (2) and W (3) (Straat et al., 2016); and (c) monoticity, that is, the function of incremental relation between the item and the latent attribute, evaluated by comparing the current and expected number of violations of the monotonic model (Mokken, 1971). The adjustment to this model generally uses the CRIT statistic, a diagnostic of the quality of the scale constructed using the weighted sum of several evaluative indicators. The result is a count of violations of the model, which through either a lax (CRIT > 80; van Schuur, 2003) or demanding criterion (CRIT > 40; Molenaar and Sijtsma, 2000), allows the identification of an excess of violations of the model, which would suggest removing the item.
For the selection of items, the following criteria were applied: (1) the point estimate of the coefficient H should be at least equal to or greater than 0.40 in the total sample; (2) the point coefficient H should be in at least two countries, equal to or greater than 0.40; (3) the lower limit of the IC in 90%, should be greater than 0.35; (4) no coefficient, in its point estimate or its lower limit, should be less than 0.30. This analytical procedure was performed using the mokken program (van der Ark, 2012; R Core Team, 2018).

Dimensionality and equivalence/invariance
To strengthen the assessment of dimensionality, the structured equation modeling (SEM) methodology was applied to identify the final characteristics of dimensionality and measurement invariance. To examine the dimensionality, we used a robust estimator for categorical variables (Muthén, 1984), which adjusts the first and second moments of the χ 2 statistic (mean-andvariance-adjusted unweighted least squares, or WLSMV; Muthén et al., 1997). This method uses a probit link to define the functional relationship between the items and the construct, as well as polychoric correlations between the items and the thresholds estimation to derive more precise parameters (e.g., factor loading) when the distributional asymmetry is strong (Sass et al., 2014;Li, 2016a,b). Potential changes in the re-specification of the measurement and invariance model were detected by (a) the modification index, at the nominal level 0.05 (WLSMVχ 2 > 3.840), and (b) in statistical power (Saris et al., 2009). IM is also a means of assessing local independence within SEM modeling (Douglas et al., 1998).
The sensitivity of each item with respect to its relation with the construct was estimated by means of a measure equivalent to the signal-to-noise ratio (SNR), which is generally an informative measure of the quality of the item, based on two information components: item discrimination and "noise" (residual variance not relevant to the construct; Ferrando, 2012a,b;Ferrando and Lorenzo-Seva, 2013). The SNR was obtained by squared factor loading (λ 2 ) on 1-λ 2 . This relationship is usually binding with the IRT model (Cheng et al., 2012;Ferrando and Lorenzo-Seva, 2013), and is generally part of the reliability estimation for identifying the maximum variability linked to the construct (Bacon et al., 1995;Hancock and Mueller, 2001).
The heterogeneity of factor loads was tested by adjusting to the tau-equivalent model, implemented with a robust procedure (Yuan and Zhang, 2012) in the coefficientalpha program (Zhang and Yuan, 2015). The adjustment of the SEM model was evaluated with several practical indexes and conventional cut points: ≥0.95 for CFI and TLI; ≤0.08 for SRMR (Ullman, 2001). Although RMSEA can be recommended in modeling with categorical variables (Hutchinson and Olmos, 1998), it was not used to decide the adjustment due to its poor performance in models with small degrees of freedom (Kenny et al., 2015;Taasoobshirazi and Wang, 2016).

Invariance/measurement equivalence
This procedure was carried out in two phases, which looked at intra-country and inter-country equivalence. The intra-country equivalence was investigated in relation to participant sex, controlling the variability of the attribute effect (measured by the total score); to reduce the effect of cells with a small number of subjects (due to the distribution), the observed conditioning score (total score) was segmented into quintiles. The analysis used was the non-parametric differential item functioning (DIF), implemented with contingency tables for ordinal variables. The partial gamma coefficient was used (γ p ; Schnohr et al., 2008), with effect levels defined as weak (>0.15), moderate (0.16-0.30), and strong (>0.31). For the purposes of this study, general interpretation suggestions were used for γ p (e.g., >0.60 = strong, >0.30 = moderate, and ≤0.30 = weak; Healey, 2012). This DIF procedure was required to address the small sample size of the compared groups (Lai et al., 2005;Güller and Penfield, 2009).
After verifying the intra-country equivalence, we continued by analyzing the equivalence between countries, through a sequence of steps appropriate for categorical variables (Wu and Estabrook, 2016), starting with a successive implementation of restrictions on the parameters of the items. The configurational invariance was analyzed first, followed by the cumulative restriction of equal thresholds, then the factorial loads, and finally the residuals. The SEM analyses were carried out with the lavaan (Rosseel, 2012) and semtools programs (Jorgensen et al., 2018). Since there are still no clear options of fit criteria for index of modification in the comparison of three groups, a liberal criterion was used to reduce the probability of Type I error. In this sense, Rutkowski and Svetina (2013) proposed less restrictive criteria in the comparison of more than two groups (but specifically, ≥ 10): CFI , TLI and RMSEA , changes less than 0.02; these criteria are similar to those conducted in large-scale studies and comparing more than two groups (OECD, 2014). For comparison purposes, criteria applied to IM were also used between two groups (Chen, 2007): CFI ≤ 0.10 and TLI ≤ 0.10. The convergence of the adjustment indices suggested the decision of indices of modification (IM), but since CFI is optimal in the comparison of nested models (Cheung and Rensvold, 2002) and reduces the Type I error (Elosua, 2011), some doubt can be resolved by the observation of CFI.

Reliability
Reliability was estimated at the item level and the score of each subscale. Regarding the items, the attenuated corrected coefficient (Wanous and Reichers, 1996) was used, given its lower bias and computational ease (Zijlmans et al., 2018); the minimum acceptable value is around 0.30 (Zijlmans et al., 2017). At the level of score, coefficients congruent with the non-parametric model were used (MS coefficient; Molenaar and Sijtsma, 1988), along with linear SEM modeling with the coefficient ω (Green and Yang, 2009) and bootstrap confidence intervals (500 replications) through the coefficientalpha program (Zhang and Yuan, 2015). For comparison purposes, the coefficient α was also calculated.

Inattentive/Irrelevant Responses to Content
Applying the Mahalanobis distance measure (D 2 Median = 13.914, min = 1.469, Q3 = 19.898), one participant (Peruvian) was detected with the maximum distance (D 2 = 138.72), and was 1.92 greater than the subject with the shortest distance (D 2 = 72.09). Although the χ 2 value was lower than the critical value (gl = 16, Bonferroni-α = 0.05, n = 46.03), the individual variability (IRV coefficient) for this participant corresponded with the maximum value of individual deviation (IRV = 1887), and it was also consistent in the identification of D 2 . To reduce the probability that the identified participant was a "positive" or "negative" influential case in the adjustment due to its magnitude compared with the rest of the participants (Pek and MacCallum, 2011), this participant was removed, leading to a total sample of 736 for the following analyses. Table 1 shows the results of the evaluation of item clarity, as part of the content validity analysis. The point estimate of the coefficients was universally over 0.70, and their asymmetric confidence intervals were predominantly over 0.60; this is a minimally acceptable level (Merino-Soto and Livia, 2009). The average clarity in each group showed similarity between Argentinian and Spanish students (about 0.82), while it was comparatively low in Peruvian students (below 0.80), but nonetheless still at a satisfactory level of perceived clarity. For some items, the lower limit of the IC was below 0.60 (item 5 in Spain, item 11 in Peru, and item 8 in the three groups). These items were reviewed by the authors, especially item 8, where the psychometric behavior was observed in order to determine the effect of this relatively low perceived clarity. In the comparison between groups (through confidence intervals of the difference, in agreement with Merino-Soto, 2018), the most frequent discrepancies occurred among Peruvian students (perceived lower clarity) compared to Spanish and Argentinians, but the point estimates and their intervals in Peruvians tended to be acceptable. The lower limit of the interval for several items was around 0.05, indicating that in the population the difference detected might be small. At this stage, it was concluded that the clarity of the instrument was essentially satisfactory in the three groups. Table 2 shows the items were distributed asymmetrically, with the highest density in the high response options; in the total sample, the asymmetry coefficients ( √ b 1 ) varied between −0.210 (item 11) and −1.065 (item 10). The kurtosis (b 2 − 3) showed more variability, with positive and negative values, and between −0.496 (item 11) and 1.041 (item 2). Overall, the items showed moderate or strong departures from normality (D'Agostino-Pearson K 2 between 15.3 and 112.5, p < 0.01).

Descriptive Statistics of the Items
In relation to some demographic variables (sex and age), in the total sample the Spearman correlation between the items and age was around zero (between −0.06 and 0.064), and predominantly without statistical significance. In each group, this trend was similar (Argentina: median = 0.042; Peru: median = −0.039; Spain: median = −0.024). Regarding sex, Spearman correlations varied between 0.030 (item 9) and 0.189 (item 4, female > male), and in each country it was also predominantly close to zero in Peru (median = 0.032), but around 0.10 in Argentina (median = 0.118) and Spain (median = 0.159). Finally, due to the tendency of responses toward high scores, several items in each country showed a ceiling effect, such that the minimum response was frequently option 2 or 3, especially in Spain and Peru. To align the analysis of latent variables with the methodology for categorical variables, options 1 and 2 were therefore integrated on these items, leaving the rest unmodified.

Non-parametric Analysis
Scalability Regarding scalability (Table 3), in the first iteration of the analysis several items showed H scores below 0.40 in the three countries, as well as low levels of scalability in their confidence intervals (items 2, 9, 11, 12, and 16); other items showed comparatively weak H in at least two countries (items 1, 4, and 14). These items thematically corresponded to behaviors of  sharing personal resources (2, 9, 11, and 14), taking another's perspective in situations of discomfort (i.e., empathy; 12 and 16), and comfort and willingness to give help to others (1 and 4). The items that were satisfactorily maintained according to the initial criteria were items 3, 5, 6, 7, 8, 13, and 15, whose contents were distributed over helping behaviors (3, 6, and 7), empathy (5 and 8), and giving supportive company to others (13 and 15). Although item 10 (interpreted as providing help through emotional comfort) partially met the initial criteria, it was not included in the resulting version so as not to overemphasize the "helping" component in the instrument score. In the left section of Table 3, the results of the final iteration are shown. The scalability coefficient for the scale was 0.50 in the countries, and around 0.50 for each item (except item 15 that tended to be a little lower, though still close to 0.50). All were statistically significant with an alpha of 0.05 (for the items, z between 28.88 and 33.83; for the total score, z = 59.83).

Monotony
Finally, no violation of monotony was detected in the version obtained from seven items (see left side of Table 3). Based on the results of the non-parametric analysis as a whole, the obtained version had the following characteristics: the scalability of the score in the total sample and in each country was greater than 0.50, and its population variability was greater than 0.48, while each item showed a moderately similar magnitude of scalability, but generally greater than 0.50.

Analysis of Dimensionality (SEM)
Because the Prosociality Scale was apparently designed as a congeneric one-dimensional measure (without restriction of statistical equality between its items), the evaluation of the adjustment started with this model. The adjustment of the congeneric model with the 16 complete items was satisfactory according to the practical indices measure (see Table 4, results of the full version). The analysis of the modification indices indicated that potential mis-specifications were inconsistent according to the criteria of statistical power and practical significance (Saris et al., 2009). Given the strength of the adjustment and some trivial mis-specifications, this model was initially retained without add re-specifications. Although all the factorial loadings were statistically significant (z > 10.0), they varied from 0.500 to 0.811, which related to a large amount of variance in the construct (between 0.250 and 0.658, respectively). This suggested a wide range of variability (levels of 0.40, 0.50, 0.60, and 0.80; Beauducel and Wittmann, 2005). The SNR for each item emphasized the difference between the factorial loads, varying from 0.333 to 1.992, suggesting that the information relevant to the represented construct could range between very weak and very strong.
According to the results of the non-parametric analysis, a second iteration of the confirmatory factor analysis (CFA) was conducted, and the results of the model adjustment are shown in Table 4 (results of the reduced version). These indicate a satisfactory adjustment, which was practically similar in the  specific indices compared with the full version ( CFI = 0.005, TLI = 0.001, SRMR = 0.004). The adjustment without the recategorized items was also satisfactory, WLSMV-χ 2 = 155.3 (gl = 14, p < 0.01; CFI = 0.985, TLI = 0.978, SRMR = 0.061). These results were superior to the adjustment criteria chosen. All factorial loads were greater than 0.60, varying between 0.675 and 0.822; the change of the loads compared with the loads of the full version varied between | 0.1%| and | 7.9%|, while the factor loading of items 5, 6, 7, and 8 showed a small increase (between 1.4 and 5.7%). The adjustment with the reclassified items was indistinguishable from the results obtained before recategorization of the items (see Table 4).

Equivalence and Measurement Invariance
The intra-country analysis (see left part of Table 4) found that, once we controlled the performance on the observed score for the number of statistical tests (Bonferroni adjustment, p = 0.007), the tendency of the partial gamma coefficients (γ p ) was essentially concentrated on the weak level (≤0.30). The items detected by possible uniform DIF (3 in Peru, and 5 in Spain) were examined in their content, and it was established that there was no reason to recognize any potential sources of DIF; therefore at this stage they were dismissed. On the other hand, although there were variations in the magnitude of the γ coefficient (not shown here) across quintiles, the homogeneity of the coefficients in the quintiles was confirmed (H-χ 2 < 15.0, Bonferroni adjusted p = 0.007), suggesting absence of non-uniform DIF.
Regarding the invariance/equivalence between countries, the baseline (configurational) model, along with the remaining models that included cumulative constraints, showed that the compared parameters (factorial loads, thresholds and residuals) changed only trivially ( Table 5). Considering the chosen criteria (Chen, 2007;Rutkowski and Svetina, 2013;OECD, 2014), the equality constraints for each level of invariance produced results that suggested no invariance, and therefore it was concluded that there was compliance with the invariance across the three levels evaluated.

DISCUSSION
The present study applied psychometric methodology and rational-theoretical evaluations to refine the Prosociality Scale constructed by Caprara et al. (2005) for adult populations. Given the cross-cultural context of this study, it was particularly challenging to show the invariance of the scale's psychometric properties, and to date this is the only attempt at a cross-cultural psychometric exploration of the scale across several Spanishspeaking countries. When the items were examined, they were characterized as not being distributed normally, characteristically with negative asymmetry. Also, the answers were oriented toward high response options. This trend was similar among the three countries examined. Associations with age were predominantly distributed around zero, both in the total sample and within individual countries. In contrast, relationships with sex were predominantly small in Spain (women > men), between trivial and small in Argentina (women > men), and completely trivial (around zero) in Peru. Considering that the differences in functioning of the items were trivial with respect to the sex of the participants, this finding for some individual items could lead to future explorations of differences at the level of the total score, but due to the strong asymmetry in the sex distribution in our samples, it would be best to avoid overinterpreting these results.
The fundamental psychometric criteria of our study were first based on a non-parametric method, created to evaluate the properties of measures that serve for ordering people based on their observed scores. Interestingly, the results of the application of the SEM and Mokken methodologies showed two things: first, they tended to show convergence in the items with lower scalability and covariation with the construct, as identified in the study by Caprara et al. (2005); and second, items with comparatively poorer properties were more clearly identified by the non-parametric method (Mokken). Specifically, with the SEM method the items in general showed factor loads that are usually acceptable in the literature (>0.30 or >0.40), while these same levels applied to the H coefficient suggested a low scalability, and therefore lessened the discriminative ability of the observed score.
The content of the resulting scale was distributed over behaviors subsumed along one dimension, partially converging with the logic of another prosociality instrument created in one of the participating countries (Argentina), which is also applicable to university students (Auné et al., 2014). In that study, the instrument was multidimensional, with correlations between weak and moderate in the heterogeneous item-construct relationship (factorial loads). The two dimensions identified were interpreted as representing empathic behavior on the one hand, and initiative to help people on the other. In its analytical exploration, the former eigenvalue was very large in relation to the remaining values, and could suggest the exploration of a general latent variable, or that items with common variance load strongly toward a latent general factor. However, the difference between the one-dimensional model proposed here, and the multidimensional model proposed in the study of Auné et al. (2014) is influenced by the design of the theoretical constructions, and a combination of post hoc conceptual and empirical criteria to refine each instrument. Nevertheless, in our opinion the higher-order construct is prosocial behavior, supported by strongly intercorrelated specific content items. Thus, in the present study, conceptual decisions balanced purely empirical and mathematical decisions.
One of the evaluated characteristics was the adjustment to a tau-equivalent model (constraint of equality of factorial loads) compared with a congeneric model (in which factor loads were free to vary), which allowed us to identify the similarity in the construct representation of items and the appropriate reliability models. As in other Latin American studies (e.g., Auné et al., 2014Auné et al., , 2016, the heterogeneity of factor loads led to doubt about the appropriateness of internal consistency estimates such as the alpha coefficient, which assume the tau-equivalent model among the items. In the present study, although the statistical test of the difference between the congeneric and tau-equivalent models was statistically significant, the practical discrepancies between the two did not seem to be moderate or strong, but rather trivial. This leads to the conclusion that the items essentially showed similarity in their representativeness of the construct, and similar sensitivity to differentiate individual variability in the measured attributes. An additional advantage of adjusting the scale to a tau-equivalent model is that it helped to recover weak factorial models (Ximénez, 2006(Ximénez, , 2016, and to avoid the rejection of models with salient factorial loading of 0.50 or less (Beauducel and Wittmann, 2005). Therefore, it is possible that the structure of the present version of the instrument can be replicated in future studies.
There are discrepancies in results regarding differences in prosociality according to participant sex (Martí-Vilar and Lorente, 2010). Some authors have argued that women show higher levels of prosociality, differences that are more marked in adult life (e.g., Eisenberg and Fabes, 1998). Other authors have noted that these sex differences depend on the motivation or type of prosocial behavior (Carlo et al., 2003;Auné et al., 2017). A plausible hypothesis that could explain this inconsistency is that certain instrument items but not others are psychometrically invariant. However, this was not verified in previous studies.
Although this study was carried out on a Spanish-speaking population, there are many differences between the societies of Spain, Argentina and Peru. Carballeira et al. (2014) showed that Latin American societies are more influenced by a collectivist culture, while Spanish society is more influenced by individualism. Such differences allow us to see the importance of this study since it involved testing the Prosociality Scale in countries with diverse cultural characteristics.
Due to the inconsistency of findings on the effect of sex on the variability of self-reported prosocial behavior, the investigation of equivalence was a preliminary, sine qua non, stage for the new version of the instrument. We found that, once the effect of the total score (measured as such) was controlled (using the DIF analysis approach), the differences in the answers were not outside the level of sampling error, and were generally trivial in magnitude. In the Peruvian and Spanish participants, two items worked differentially when the effect of the total score was controlled. Although the statistical detection of DIF does not directly indicate the absence of real bias (Lai et al., 2005), this is an avenue for further investigation. A qualitative analysis was beyond the objectives of this study, and thus the sources of this differential functioning were not qualitatively explored, so the conclusion of equivalence between men and women within each country is something to be tested by subsequent studies. Although this conclusion should be interpreted in the context of the limitations of the study (sample size and asymmetric proportion of men and women in each country), our results with the new reduced version can also be considered internally valid due to the strength of the unidimensional measurement model. As previously found, the unidimensionality of the new version is characterized by items with strong factorial loads, high signal-tonoise ratio, and an interdependent content relating to different observed behaviors.
Regarding the limitations of the study, one of these is the sample size. This can be considered large (>500) in terms of the total group size (Finch and French, 2008;Ximénez, 2016;Finch et al., 2018), but for the intra-country analysis it can be considered small (Ximénez, 2016). This could explain certain idiosyncratic variations between countries found in this sample. The intra-country sample sizes of our study, however, are typical of the common situation of small (or moderate) samples in social science research, and particularly in psychology (Beauducel and Wittmann, 2005). As more generally in psychology, the sample size of this study, in the total sample and in each subgroup, may generate suboptimal conditions for estimating psychometric parameters and their potential replicability. Although this problem is shared with many studies in the social sciences in general, and in psychology in particular (Beauducel and Wittmann, 2005), other aspects should also be considered to evaluate the potential replicability of our results: for example, the high magnitude of the factorial loading, as well as the convergence between the methodologies that were applied, and between the levels of statistical significance and practical significance that were found. Indeed, the application of several methods to identify dimensionality (within a framework of sensitivity analysis) can lead to more confidence in the results obtained, given the convergence observed.
A second limitation of the study is the asymmetric proportionality between men and women. However, the distribution of men and women in the sample may reflect current sex distributions among undergraduate students of psychology in Argentina, Spain, and Peru (and indeed other countries). Anecdotal evidence from the authors regarding said distribution supports this idea. A third limitation was the criterion used to decide on measurement invariance, since although the results of the adjustment met conventional criteria (≥0.90 or 0.95, Hu and Bentler, 1999) and other revised criteria (≥0.96; Yu, 2002), such criteria continue to be the subject of debate and further methodological research. This seems to be more prominent when comparing more than two groups (but less than ten), and in the context of asymmetric distribution of participants and moderately small sample size. The criteria applied in the present study (Chen, 2007;Rutkowski and Svetina, 2013;OECD, 2014) might produce Type I or II errors, and our criterion was essentially liberal. As the present study is one of the first of its kind, this decision should be re-evaluated for future studies. However, one aspect that balances this problem was that the approach of evaluating invariance/equivalence (applied to categorical variables) tends to yield robust and sensitive performance (Kim and Yoon, 2011;Sass et al., 2014). Another limitation is that the possible effect of the social desirability of the responses was not verified; this problem may have been reduced by the anonymity of data collection, or it may show correlations between moderate or weak (Rodrigues et al., 2017), and the reader is suggested to assess our results in the context of this limitation. Finally, other evidences of validity are required to corroborate the theoretical representation of this modified version of the instrument. Future studies should focus on the limitations of the study to advance the replicability of the results, as well as to obtain other evidence of validity required to open the way to substantive research with the instrument re-constructed here. This would contribute to our knowledge of prosociality measures, which are still an emerging area of investigation in measurement issues (Martí-Vilar et al., 2019).

DATA AVAILABILITY STATEMENT
The datasets generated for this study are available on request to the corresponding author.

ETHICS STATEMENT
This study was carried out in accordance with the recommendations of the United Nations Educational, Scientific and Cultural Organization (UNESCO), Declaration of Helsinki, and indicators of the Ethics Committee of the Universitat de València, No. H14820253925 (February 2, 2017). The studies involving human participants were reviewed and approved by the Ethic Committee of Universitat de Valéncia. The patients/participants provided their written informed consent to participate in this study.

AUTHOR CONTRIBUTIONS
MM-V, CM-S, and LR designed the research and collected the data. CM-S analyzed the data. CM-S and LR interpreted the data. MM-V, CM-S, and LR drafted the manuscript. All authors critically revised the manuscript and gave their approval to the final version to be published.