“Descriptive analytics: its power to test the applicability of cross-national scales in exploratory studies”

Conventional methodology for validating measures in consumer research relies on structural equation modeling. But, this procedure requires a fairly large sample size and a clear conceptualization of the relationship between individual items and various scale dimensions. Neither of these requirements may be met in exploratory cross-national studies. Hence, this paper addresses scale validation issues in exploratory cross-national research, where sample size is a major concern. Specifically, it uses cross-national data on the vanity measure as an exemplar and a battery of descriptive analytics to show how to assess scaling assumptions, reliability, and dimensionality of consumer behavior measures. The scale validation procedure the authors describe in this paper has implications for researchers who use multi-item rating scales as measures of consumer behavior constructs.


Introduction 
International business studies typically involve application of consumer behavior measures to investigate cross-national differences (Halkias, Davvetas and Diamantopoulos, 2016; Durvasula and Lysonski, 2015).While some of those studies may have used established measures (e.g., CETSCALE, SERVQUAL), others may be exploratory in nature.In either case, researchers use multi-item rating scales based on the Likert scale format to measure the underlying construct.Consumer responses are, then, summed (or averaged) to form an overall score.The validity of measurement scales, however, is based on certain assumptions (e.g., internal consistency, external consistency, and unidimensionality).When those scaling assumptions are not met, we have no way of knowing whether observed cross-national consumer differences on the summed scores are due either to translation problems, country-specific differences in the definition of the consumer behavior construct, or due to true consumer differences on the underlying construct.Hence, it is crucial to design valid crossnational measures that satisfy underlying scaling assumptions (Steenkamp and Baumgartner, 1998).
Conventional methodology for assessing crossnational scale applicability calls for an application of confirmatory factor analysis and the use of structural equation modeling (SEM) (Steenkamp and Baumgartner, 1998).However, it has certain limitations; the most notable being that SEM is sample-intensive.For cross-national researchers, such limitations may be likely in that scale analysis commonly takes place first in pilot or exploratory studies, where sample sizes can be rather small (Netemeyer, Durvasula and Lichtenstein, 1991).Some of the well-documented limitations of SEM are non-convergence and improper solutions (Anderson and Gerbing, 1985;Boomsma, 1985), as well as bias in the estimated factor loadings and standard errors (Anderson and Gerbing, 1985).Evidence suggests that this bias in parameter estimates exists in small samples, irrespective of the estimation procedure (e.g., maximum likelihood) used (Benson and Felishman, 1994;Dolan, 1994).Furthermore, in exploratory cross-national studies, researchers may be performing scale analyses with not so well-defined expectations about the structure of the items and their relationships to various measures.A key objective of those studies may be to determine whether a set of items forms a unidimensional scale consistently across various countries.As such, there is a need for a method that is easier to use vs. SEM in exploratory research or in studies that are based on small samples.
Often, researchers who work with small samples, but who are unfamiliar with SEM, often rely on reliability analysis and exploratory factor analysis.Yet, reliability analysis is restricted to assessing internal consistency, because only items from a single scale are considered.Internal consistency addresses only one of the measurement issues.The other major issue is dimensionality, which is not addressed by reliability analysis.In contrast, exploratory factor analysis is designed to assess items from multiple scales.But, as the technique is not rooted in a measurement model such as the classical test model (Saris and Hartman, 1990), it only provides an indirect test of unidimensionality.Moreover, as exploratory factor analysis does not permit researchers to constrain items to load on specific factors, results based on exploratory factor analysis can even be misleading (Steenbergen, 2000).
In sum, existing approaches for assessing scaling assumptions in exploratory cross-national research studies are inadequate, particularly when small sample sizes are a concern.The goal of this paper is to develop a tool kit of descriptive analytics that helps to assess scaling assumptions and the crossnational applicability of measures used in consumer research.We believe that our paper adds to the research stream in cross-national studies that focuses on measurement issues.In the remainder of the paper, we explain the importance of dimensionality in cross-national research, outline the descriptive analytics based approach for assessing cross-national scale applicability, present the results based on an analysis of four-country data on vanity, and conclude with a discussion of the proposed approach.
1.The importance of measure dimensionality in cross-national research Dimensionality of consumer behavior measures is dependent on the behavior of individual scale items.Scale items are uni-dimensional if they satisfy two conditionsinternal consistency and external consistency.For a scale to be internally consistent, scale items should be associated with each other.Further, the correlations among scale items must be attributable entirely to their association with a common underlying construct or dimension (Gerbing and Anderson, 1988).External consistency, on the other hand, implies that no scale item should tap more than one construct.Any correlation between items from different scales, then, can be attributed entirely to the correlation between the underlying constructs (Gerbing and Anderson, 1988).To establish scale dimensionality, it is, therefore, imperative to examine both internal consistency and external consistency of scale items.
Establishing dimensionality of consumer behavior measures, in turn, is of paramount importance in cross-national research (Clark and Watson, 1995;Netemeyer, Bearden and Sharma, 2003).In the process of operationalizing latent constructs, researchers often use composite scores by summing or averaging across items designed to measure the construct of interest.The application of such scores is only meaningful if the items have uni-dimensionality.When multidimensional scales are treated as uni-dimensional (i.e., summed or averaged item composites), they could result in interpretational ambiguities.In other words, if a construct were to be multidimensional, but all item scores were to be summed/averaged across dimensions into a single composite score and correlated with a criterion variable, such a correlation would at best ambiguous and at worst, misleading (Durvasula et al., 2006).
Neuberg, West, Thompson and Judice (1997) presented a persuasive case as to why multidimensional scales should not be treated as if it they are uni-dimensional.Of critical importance to cross-national research, if the dimensionality of a scale varies from one country to the other, any mean comparisons based on composite scores would produce worthless results.As such, establishing dimensionality of measures is a necessary condition for internal consistency, construct validity, and model testing.When measures exhibit validity in various countries, then, they become crossnationally applicable.
Hence, applicability presupposes validity.The preferred method for establishing cross-national applicability of measures, and the only one that most scholars in consumer research are acquainted with is SEM.The following section provides an alternative method for establishing scale validity, one that is more appropriate if researchers are confronted with small samples, especially in exploratory studies.

An alternative approach for establishing scale applicability
While SEM offers a strong test for measure applicability, its application in small sample crossnational research studies may be limited because of non-convergence of parameter estimates, empirical under-identification of factor models, and improper solutions.In contrast, exploratory factor analysis may seem like a more appropriate method that is tailored to small-sample studies, but this method does not permit researchers to specify or constrain as to which items should be associated with what scale dimension.Therefore, the factor in a factor analysis represents a statistical construct, but it cannot be thought of as a psychological construct.For assessing scale dimensionality, however, the ability to specify which items relate to what constructs is important.What, then, is a feasible alternative?The answer can be found in the works of Likert.Likert (1932) listed key assumptions of summated rating scalesones that must be met before a scale can be applied to examine group differences.Using a battery of descriptive analytics, Ware and Gandek (1998) showed how to test those Likert scale assumptions for measures used in life sciences.We adapt that method and propose it as an alternative approach to SEM for establishing cross-national applicability of consumer behavior measures, especially where sample sizes are relatively small.The procedure involves examining a measure at the item level, as well as at the scale level.If the results are supportive, and if they are consistent across countries, then, the underlying measure will have cross national applicability.The outline of this alternative approach to SEM is shown in Appendix in Fig. 1.

Missing and out-of-range values.
If a large amount of data is missing, then, it is impossible to measure the underlying concept with confidence.Instead, it indicates the possibility that the respondents did not understand how to respond to the scale items, or the likelihood that they had difficulty with wording of the scale items.
2.1.2Floor and ceiling effects.These effects determine whether all of the response choices of a scale are used, or if consumers deliberately chose certain response styles such as consistently checking only the scale end points.When consumers exhibit different response styles in different countries, then, it is impossible to determine whether or not they actually differ on the underlying concept, making any further cross-national comparisons futile.While the floor effect indicates dis-acquiescence response stylethe tendency of respondents to strongly disagree with a statement (i.e., choose a response of 1 or 2 of a 7-point scale item), the ceiling effect indicates the acquiescence response style, that is, it shows what percent of respondents strongly agreed with a statement (i.e., selected the options of 6 or 7 on a 7-point scale).If extreme response styles are more prevalent in some countries than the others, then, the researcher is confronted with response style bias in those countries.It, then, becomes impossible to determine real differences across countries on the underlying construct.Ware and Gandek (1998), under traditional likert scaling criteria, means of individual scale items should be fairly equal.This will usually be the case if all scale items are tapping the same part of the construct domain.However, if the researcher uses different items to tap different aspects of the concept domain, then, it is possible and acceptable to have nonequivalent item means for some of the scale items.More importantly, while the average score of individual scale items is expected to vary with the level of the underlying construct (e.g, high, medium, or low level of vanity) and with the populations sampled, the placement order of item means and the approximate differences between them should not vary across countries.Otherwise, if placement order of item means of a scale were to vary significantly across countries, then, it calls into question the cross-national applicability of the corresponding measure.

Item variance.
The best items of a measure are those that exhibit significant variability, as they play a useful role in detecting differences across consumer groups in terms of the underlying concept.Also, variances (or standard deviations) of individual items should be roughly equal.However, it is not always possible to obtain high item variances.In such cases, it may still be appropriate to include an item as part of a measurement scale, if the reason for its inclusion is to tap an important part of the construct domain that is not being measured by other scale items.

Multi-trait/multi-item correlation matrix.
The multi-trait item-scale correlation matrix provides information to test a number of scale assumptions that affect cross-national applicability of a measureinternal consistency, equality of item-scale correlation, and external consistency.

Item internal consistency.
It measures the correlation of each scale item with the composite or sum score of all the remaining items in the scale.Item internal consistency can be deemed satisfactory if an item correlates 0.40 or more with its hypothesized scale.However, for items whose correlation with the sum score is below 0.40, whether or not to delete those items from the scale depends on how critical those items are to capturing the true domain of the underlying concept.

Equality of item-scale correlations.
Items in a scale should have fairly similar correlations with the composite or sum score of the scale.Otherwise, a low correlation implies that the corresponding scale item does not contribute equal proportion of information to the associated measure.Items that do not contribute enough information should, then, be excluded.The remaining items should all be given the same weight when computing the composite or sum score.There is strong empirical support for this practice so long as items have approximately equal correlations with their target scales (Armor, 1974; Ware and Gandek, 1998).However, when all items contribute significantly to the total score, this standard or equality of item-scale correlations can be considered satisfied, even if item-scale correlations vary (from 0.40 to 0.70 or more).

Item external consistency (i.e., item discriminant validity).
It is not enough to show that an item measures the concept it is supposed to measure.For discriminant validity, it is also important to show that the item does not correlate highly with measures of other concepts.The multiitem multi-trait correlation matrix can be used to compare the correlation of an item with its hypothesized scale to the correlation of the same item with all the other scales.For external consistency or discriminant validity, the correlation of a scale item with composite scores of other scales should be smaller (vs.correlation of the scale item with other items of the same concept).

Item similarities.
An even more systematic way of assessing scale dimensionality is by examining item similarities.Similarity coefficients are derived from the statistical consequences of uni-dimensionality, so they provide a better way of evaluating scale dimensionality, as compared to exploratory factor analysis (Steenbergen, 2000).The greater the similarity between two items as measured by the similarity coefficients, the better those items fit the classical test model, and the more valid the conclusion is that they form a uni-dimensional scale.

2.3.
Assessing overall scale quality based on scalelevel analytics.

Scale-level descriptive analytics.
After itemlevel analytics in each country, the next step is to apply scale level analytics.This involves comparing scale means, standard deviations, floor and ceiling values.In the event significant differences are found in mean scale scores across cross-nationally comparable samples, one should perform further analysis to determine if the differences are due to translation problems or to country-specific differences on the underlying construct.Similar to the expectation that individual items should have high variance, scale scores should also have high variability.As explained by Ware and Gandek (1998), this requirement is even more crucial for scale validity.

Scale internal consistency.
The average of all inter-item correlations within a scale points to the internal consistency of a measure.A minimum reliability level of 0.70 has been suggested (Nunnally and Bernstein, 1994) for acceptable scale reliability.

Correlations between scales.
To evaluate how distinct each scale is from other scales, correlations among all scales are computed and compared with reliability estimates (Campbell and Fiske, 1959), where the reliability coefficient can be viewed as a correlation between a scale and itself.To the extent that the correlation between two different scales is less than their respective reliability coefficients, there is evidence that each of the two scales possess discriminant validity.In contrast, when the correlation between two scales is close to their respective scale reliabilities, then, those scales lack discriminant validity.Instead, they can only be viewed as alternate form measures of the same concept.The scale reliabilities and inter-scale correlation, thus, help us determine whether or not the two scales have discriminant validity.
In sum, if the battery of descriptive analytics provide consistent results cross-nationally as described above, then, there is support for crossnational applicability of the underlying measure.Individual items of the scale can, then, be summed and the composite score to examine cross-national mean differences.In the next section, we describe the vanity measure and our cross-national data set to illustrate how to assess cross-national validity of a measure.

Method
The vanity measure is a scale consisting of 21 scale items.They collectively assess four distinct yet related concepts of vanity (Netemeyer, Burton and Lichtenstein, 1995).The various scale items of this measure were obtained using the likert scale.Figure 2 shows a description of the vanity scale.This scale was applied in four countries, the United States, New Zealand, China, and India.While the U.S. and New Zealand are developed countries representing the Western cultures, China and India are developing countries that represent the Eastern cultures.An average of 100 respondents completed the vanity scale across the four countries.While the survey in China was administered in Chinese, the English version was administered in the other three countries.Appropriate translation procedure was employed to convert the original English version of the survey into Chinese.Across all four countries, young adults with similar educational background completed the survey.As opposed to random samples, comparable samples such as young adult samples are necessary to facilitate cross-national comparisons (Appendix, Fig. 2).

Empirical illustration
For illustration purposes, we followed the descriptive analytics procedure as described in Figure 1 to assess cross-national applicability of the vanity scale.Following is a summary of the results.1, across the four samples, the percentage of missing values is generally very small for all vanity scale items.Only in the case of 3 items out of 21 (AC1, AC2, and AV4), that too limited to the Indian sample, the missing value percentage exceeded 5%.

Item variance.
In general, items exhibited greater variability in India and China as compared to the U.S. and New Zealand.For an item measured on a 5-point rating scale, it is desirable to have a standard deviation of about 1. So, for a 7-point rating scale, this value should be above 1.Compared to this recommended value, the standard deviation is somewhat low for 3 (out of 21) items in the U.S. sample.However, for 13 of the items in the U.S. sample, the standard deviation is well in excess of 1.

Mean values.
When making cross-national comparisons we should look for similarity of item means and whether those values are ordered in a roughly similar fashion across the samples.An inspection of item means for the physical view dimension, PV1 has the highest mean and PV3 has a fairly low mean value across the four samples.As for the physical concern and achievement concern dimensions, all item means are above 4.The only exception is the mean score for PC3 in the U.S. sample.Another similarity across the samples is that among items representing achievement view, AV4 and AV5 have the smallest mean.In sum, there appears to be a semblance of order among item means across the four samples.

Item floor and ceiling values.
The term "floor" represents selection of the lowest response category (e.g., strongly disagree), whereas the term "ceiling" represents selection of the highest response category (e.g., strongly disagree).High ceiling values suggest the possibility of acquiescence response bias.A high ceiling value coupled with a high floor value suggests the possibility of extreme response style bias.For 7point rating scales, when there is a symmetric distribution of responses, then, we would expect 14% of respondents to select each response category.So, floor or ceiling values far in excess of 14% and a combined floor and ceiling value above 28% raise concern about significant response patterns.Further, sizeable differences in those values across samples imply that scale responses are affected by response style biases.
Results of floor and ceiling percentages are provided in Appendix in Table 1.It is evident that in India and China, there is a greater likelihood of strongly agreeing with physical concern and achievement concern items.That is why the ceiling percentages are fairly high.However, it can be argued that globalization and the impact of global media have significantly raised concern for physical appearance in India and China.The intense job competition in these two countries, among other factors, is likely to have raised concern for professional achievements in India and China.Given such possibility, perhaps high ceiling percentages for physical concern and achievement concern related items is not unusual in India and China.Moreover, the mean responses to items representing physical concern and achievement concern aren't significantly higher in these two countries as compared to the U.S. and New Zealand, where extreme responses are not as common.It is for the same reason that the low floor and ceiling percentages in the U.S. and New Zealand, which otherwise would have suggested middle response bias, also present no major measurement issues.

Internal consistency.
As shown in Table 2, across the four samples, the range of correlations of individual items with the target scales are in excess of .4.The diagonal of the multi-trait multi-method matrix provides this information.While it is not shown in Table 2 for the sake of brevity, the only exceptions are items PC1 and PC4 representing physical concern in China and AC5 representing achievement concern in India, where item correlations with their target scales were below .4.

Equality of item-scale correlations.
Even though correlations of items with their target scales are not equal as per Table 2, for the most part, items in each sample have contributed significantly to their respective scales (i.e., item-scale correlations for target scales are > .4).Hence, as described in section 3.2.2, it can be concluded that all items representing their respective target scales contribute equally to item-scale correlations.

Item external consistency.
Item-scale correlations for non-target scales (i.e., offdiagonal correlations in Table 2) are consistently lower as compared to item-scale correlations for target scales (i.e., correlations that appear on the diagonal).For example, in a general sense, items representing physical concern items have higher correlations with the physical concern dimension than with any other dimension.While not presented in Table 2, even for PC1 and PC4 in China, their correlations with physical concern dimension are higher than their correlations with the other three vanity dimensions.Likewise, AC5 has a higher correlation with its target dimension, achievement view, as compared to its correlation with the other three vanity dimensions.

Item similarity coefficients.
As shown in Table 3, for all items, the item similarity coefficients, computed using Steenbergen (2014), are higher for target scale dimensions than for non-target scale dimensions.Also, those similarity coefficients are above the recommended benchmark of 0.8 for target scales (cf.Anderson and Gerbing, 1982), providing support for uni-dimensionality for each of the vanity scale dimensions.In sum, the results of various exploratory tests suggest that for all four dimensions of vanity, the various scaling assumptions have been reasonably met across the four samples.Therefore, each of the four vanity dimensions is uni-dimensional.Items representing individual scale items of each vanity dimension can now be summed (or averaged) to form scale composites for cross-national comparison purposes.
The vanity scale has demonstrated cross-national applicability.

Conclusion
Cross-national studies that use Likert scales to measure consumer behavior constructs must first demonstrate that the underlying measures satisfy various assumptions -at the item and at the scale levelbefore they can be deemed cross-nationally applicable.Tests for assessing scale applicability are typically carried out using SEM.While SEM is preferred for large samples, and also when prior knowledge is a vailable on scale dimensionality of measures, they are not useful when working with small samples.But, small samples are often unavoidable in exploratory crossnational research studies.When faced with small samples, the researchers can still evaluate scaling assumptions by using a battery of descriptive analytics.In this paper, we have documented key scaling assumptions and how they can be tested with descriptive analytics.We have, then, demonstrated how to test the various assumptions by using cross-national data on the 4-dimensional vanity measure.Across the four countries, most of our results are supportive of the vanity measure's validity.Applying our suggested analytics procedure is a prerequisite for summing the individual item scores to form scale composites.At the end, if the various scaling assumptions are met in individual countries, then, the corresponding measure can be applied crossnationally to examine consumer mean differences.

Physical-Concern Items
The

Table 2 .
Multi-trait multi-method correlation matrix Note: Table shows range of correlations of individual scale items with the scale composite scores of the 4 vanity dimensions -Physical Concern (PC), Physical View (PV), Achievement Concern (AC), and Achievement View (AV).For example, the range of correlations of the 5 PC scale items with the PC scale composite .64 to .72 in New Zealand.The correlations of PC scale items with PV scale in New Zealand range from .27 to .31.

Table 3 .
Item similarity coefficientsTable shows the range of similarity coefficients of individual scale items with the 4 vanity dimensions -Physical Concern (PC), Physical View (PV), Achievement Concern (AC), and Achievement View (AV).For example, the similarity coefficients of the 5 PC items with the PC scale composite range from .95-.96, and with the PV scale those same items' similarity coefficients vary from .68 to .81.

Table 4 .
Scale level descriptive statistics

Table 5 .
Reliability coefficients and inter-scale correlationsNote: Internal consistency estimates of reliability are presented on the diagonal and inter-scale correlations (e.g., correlation of physical concern with physical view) are presented off the diagonal.

Table 1 .
way I look is extremely important to me (PC1) Item level descriptive statistics