Consumer susceptibility to front-of-package (FOP) food labeling: Scale development and validation

.


Introduction
Although retailers change their assortments regularly, the average grocery store carries more than 30,000 items, making it difficult for consumers to make informed decisions about which products to buy (Baum, 2019).Policymakers, food manufacturers, and other entities in the food industry have introduced front-of-package (FOP) labels to help shoppers navigate this dense jungle of product alternatives and simplify their decision-making process (Kühne, Reijnen, Granja, & Hansen, 2022).For example, food manufacturers sometimes use traffic light labels to inform consumers about the nutritional value of their products (i.e., green indicates high nutritional value and red indicates low nutritional value; Golan, Kuchler, Mitchell, Greene, & Jessup, 2001).
Despite efforts to promote certain product choices, FOP food labeling systems do not always work as intended.For instance, some studies demonstrate that introducing traffic light labeling on seafood neither increases sales of green-labeled products nor decreases sales of redlabeled alternatives (Hallstein & Villas-Boas, 2013).Similarly, adding FOP labels indicating healthier choices leads to a mere 1.5-2% increase in the frequency of purchasing these products compared to their less healthy alternatives (Temple, 2020).On the other hand, research shows that nutritional FOP warnings are effective in helping consumers perceive products with unfavorable nutritional profiles as unhealthy (Cabrera et al., 2017).Numerous studies support the effectiveness of nutritional FOP warnings in assisting consumers in making healthier food choices (Ares et al., 2018(Ares et al., , 2021)).Interestingly, fictitious seafood sustainability labels sometimes increase consumers' willingness to pay for labeled products more than genuine third-party certificates (Sigurdsson et al., 2022).In light of these findings, contradictory at times, what determines whether consumers' preferences and decisions sway in response to the introduction of FOP labeling systems?
Previous research has suggested that certain consumers may be more likely than others to rely on FOP labeling systems when purchasing food (Sigurdsson et al., 2020).Fewer than 10% of consumers say they always or often buy seafood with sustainability labels, and 75% say they never buy such products (Clonan, Holdsworth, Swift, Leibovici, & Wilson, 2012), supporting the notion that there are different segments of consumers in terms of their propensity to use FOP labeling systems when shopping.Yet, it remains unknown why some consumers actively adjust their food preferences and decisions when confronted with FOP labeling, while others appear unaffected by such cues.In other words, the existing literature largely neglects the role of individual differences in consumer

FOP labels and consumers' decision-making
The literature to date has identified numerous external factors associated with consumer preferences and decisions, such as food healthiness, convenience, and price (Prescott, Young, O'neill, Yau, & Stevens, 2002).Yet, internal factors (e.g., food neophobia and disgust), which constitute the focus of the current research also play an influential role in shaping consumer preferences (Siegrist, Hartmann, & Keller, 2013).In food-related contexts, the literature differentiates between animal reminder disgust and core disgust (Hartmann & Siegrist, 2018).Reminder disgust refers to reactions to stimuli like animal corpses or bodily damage, exemplified by statements such as, "It would bother me to be in a science class and see a human hand preserved in a jar."Core disgust, on the contrary, pertains to responses to the oral incorporation of offensive items such as monkey meat (Olatunji, Haidt, McKay, & David, 2008).For example, consumers who have higher (vs.lower) scores on the animal reminder disgust and core disgust are less (vs.more) inclined to enter a lottery to win a ticket to an insectarium with a buffet of insect dishes (Hamerman, 2016).
Although the literature has examined the influence of FOP labeling systems on consumer outcomes such as purchase intentions or perceptions of food healthiness (Ikonen, Sotgiu, Aydinli, & Verlegh, 2020), little scholarly interest has been directed toward individual-level predictors of the propensity to use FOP labeling in food-related decision-making contexts.Notwithstanding this general lack of research, the sparse literature on this topic has established, for example, the moderating role of consumers' dominant information processing system on the use of FOP labels in food-related judgments and decision-making, consistent with the dual process theory of mental processing that distinguishes between fast and automatic versus slow and deliberate information processing (Sanjari, Jahn, & Boztug, 2017).However, this account cannot be solely taken into consideration when explaining individual differences in the extent to which consumers rely on FOP labeling systems in their food-related decision-making because when consumers face difficulties in processing information, they tend to switch from automatic to deliberate processing (Alter, Oppenheimer, Epley, & Eyre, 2007).This, in turn, implies that fast versus slow information processing depends on situational factors rather than on predispositions alone.Thus, the dual process model of mental operations does not explain why some consumers rely more on FOP labels for their food choices than others, as recent studies suggest (cf.Sigurdsson et al., 2020).

Consumer susceptibility to FOP labeling
Building on previous research indicating that consumers vary in their susceptibility to external factors-such as the opinions of others-in making decisions (Bearden, Netemeyer, & Teel, 1989), we propose that shoppers also vary in their susceptibility to FOP labeling systems.We formally define consumer susceptibility to FOP labeling as "the extent to which consumers rely on FOP labeling in their food choices and information search activities to meet their existing preference patterns."Because susceptibility to FOP labeling is a trait variable, its values should be relatively stable over time, with only minor shifts due to external factors such as the complexity of FOP labeling (cf.Alter et al., 2007).We expect that susceptibility to FOP labeling systems is related to other trait variables that moderate reliance on heuristics rather than deliberate thinking.Hence, individual differences in thinking styles (analytic vs. holistic thinking; Nisbett, Peng, Choi, & Norenzayan, 2001) may be an example of variables related to our focal construct.
In what follows, we develop a scale to capture our trait-level variable of consumer susceptibility to FOP labeling and examine the psychometric properties of this newly developed instrument.In Study 1a, we develop an initial pool of items, which is then tested for face validity in Study 1 b, leading to the refinement and reformulation of certain items.In Study 2, we truncate the scale and evaluate its psychometric properties.In Studies 3a-b, we test the convergent and divergent validity of the scale and conduct a test-retest reliability analysis.Finally, in Study 4, we examine the criterion validity of the consumer susceptibility to FOP labeling (CSFL) scale to ensure that it predicts willingness to buy products as a function of FOP labeling systems used when presenting food products.

General method
In analyzing the psychometric properties of the scale, we chose to use nonparametric item response theory (NIRT) instead of the traditional factor analysis approach, followed by the specification of Cronbach's alpha.There were several reasons for this analytic decision.Item response theory (IRT), either parametric or nonparametric, offers distinct advantages over factor analysis.Unlike factor analysis, which assumes linear relationships and focuses on correlations between items, IRT models the performance of individual items based on their underlying latent properties (Embretson & Reise, 2013;Sijtsma & Molenaar, 2002).As such, IRT provides more detailed item-level diagnostics, allowing for greater understanding and refinement of individual items, which is not possible with factor analysis (Bortolotti, Tezza, de Andrade, Bornia, & de Sousa Júnior, 2013;Dima, 2018).In addition, IRT is more robust when handling missing data (Hambleton & Swaminathan, 2013).
NIRT also offers several advantages over parametric item response theory (PIRT).NIRT is more robust to violations of parametric assumptions, whereas PIRT models assume that the latent trait is normally distributed and that item response functions (IRFs) follow a parametric form (Sijtsma & Molenaar, 2002).This inherent flexibility makes NIRT particularly potent to the complexity of real-world data and makes it a preferred choice for datasets that cannot adhere to strict assumptions.NIRT models thus make fewer assumptions about the distribution of the latent feature and IRFs, making them more robust to violations of parametric assumptions (Dima, 2018;Embretson & Reise, 2013).Further, NIRT can be used with a broader range of item types, including polytomous items (items with more than two response options), graded response items (items with ordered response options), and rating scale items (De Ayala, 2013).On top of these advantages, NIRT models are more flexible in modeling item response functions than PIRT models, which typically assume that IRFs follow a particular parametric form.For that reason, NIRT models allow for greater flexibility in representing the relationship between the latent trait and the item response (Gibbons & Chakraborti, 2014).Nonparametric models also require smaller sample sizes than their parametric equivalents, making the process of scale development more efficient (Chen et al., 2014).
We began assessing the psychometric properties of the scale by applying a Mokken scale analysis (MSA) to examine its unidimensionality, monotonicity, local independence, and invariant item ordering (Van der Ark, 2007, 2012).Scale unidimensionality means that a set of items can be located on a continuum of a latent variable.The monotonicity criterion means that the probability of affirming a particular response does not decrease as the values of the latent variable increase.Local independence conveys the notion that the items are related only by the latent variable they are intended to measure.Finally, invariant item ordering is characterized by the items or responses being arranged similarly for different levels of the latent variable (Dima, 2018;Sijtsma & Hemker, 1998;Sijtsma & Junker, 1996).Whereas a set of items can be considered a scale if the first three criteria are met, which are essential for a monotone homogeneity model, an invariant item ordering is necessary to meet the criteria for the double monotonicity model (Mokken, 1971;Van der Ark, 2007, 2012).
Unidimensionality was formally assessed by examining the homogeneity coefficients H, ranging from 0 to 1, with 0 indicating no association between items and 1 indicating a perfect association between them, i.e., a strongly unidimensional scale (Van der Ark, 2007).To further investigate whether the scale was unidimensional, we applied the automated item selection procedure (AISP), which positions items in scales with increasing degrees of homogeneity (Hemker, Sijtsma, & Molenaar, 1995).The remaining scale properties, i.e., monotonicity, local independence and invariant item ordering were evaluated using tools from the mokken package for R ( Van der Ark, 2007).
All analyzes were performed in R version 4.2.2 (R Core Team, 2022).Studies were coded in the free-to-use and publicly available PsyToolkit (Stoet, 2010(Stoet, , 2017)).This project has been peer-reviewed and deemed low risk (Massey University Human Ethics Notification number: 4,000, 023,879).All participants accepted informed consent forms before participating in the studies described below.We used the same exclusion criteria across Studies 2-4.

Item generation (Study 1 a)
The goal of Study 1 was to develop an initial pool of items for further evaluation.Consistent with recommendations (e.g., Boateng, Neilands, Frongillo, Melgar-Quiñonez, & Young, 2018), we developed several times more items than the number of items in the desired final scale and asked laypeople from the target population to suggest additional items.

Participants and procedure
Participants in Study 1a were the paper's four authors, who have extensive experience in research on FOP labeling systems.We additionally recruited 10 US participants from Prolific Academic who declared no dietary restrictions (Mean age = 29.5,SD = 6.9 years, four females, five males, one participant did not identify with either of the two genders, mean annual income = 44,700 USD), with this sample combination (14 participants in total) following best practices in scale development research, considering that it usually results in broadening the coverage of a target construct (Boateng et al., 2018).In estimating the desired sample size, we aimed to double the number of items developed by the researchers.Therefore, 8-10 additional participants were deemed suitable for this task, as this number exceeds common sample size conventions at the initial stage of the scale development process in food-related contexts (e.g., Hemmer, Hitchcock, Lim, Kovacic, & Lee, 2021).
The researchers initially generated 25 items.After becoming familiar with the definition of our focal construct-as described in the introduction-the 10 additional Prolific participants were then asked to "write three short and positively worded sentences below that measure the importance of food labeling/certification to consumers when purchasing food," resulting in an additional 30 items.Subsequently, this combined pool of 55 items was independently evaluated by four raters (i.e., the researchers involved in this project) on a four-point scale for relevance to the construct of interest (1 = Not at all relevant; 4 = Very relevant).

Results and discussion
To assess the degree of agreement between raters, we calculated an intraclass correlation coefficient (ICC; Shrout & Fleiss, 1979).Results from a mean-rating, consistency, two-way mixed-effects model indicated good reliability of ratings between the four raters, ICC = 0.86, 95% CI [0.78, 0.91] (Koo & Li, 2016).The raters estimated the mean relevance of these 55 items to the focal construct as slightly below the scale midpoint (M = 2.42).At this stage, items whose relevance to the focal construct was evaluated as median and below were removed from the item pool, leaving 21 items for further evaluation.Subsequently, we combined five items that referred to specific product categories (e.g., "When buying groceries, I seek labeled/certified seafood") into one more generic item that referred to food more broadly defined, thereby compressing these five items into one generic item, resulting in 17 items used for the follow-up evaluation in Study 1b.

Scale construction and face validity (Study 1 b)
Study 1 b was conducted to further evaluate the 17 items developed in Study 1a and select the most relevant items to the construct of interest.Participants from the scale's target population also rated the comprehensibility and grammatical correctness of the items, providing input for possible item reformulation and refinement.

Participants and procedure
In Study 1 b, we recruited 36 US participants from Prolific Academic who declared no dietary restrictions (Mean age = 38.2,SD = 14.4,47.2% females, 52.8% males, mean annual income = 46,600 USD).Similar sample sizes have been used in other face validity studies recruiting non-experts online (e.g., Umanath & Coane, 2020).
Participants were first informed of the purpose of the study and presented with an example of an FOP food label along with our formal definition of consumer susceptibility to FOP labeling.They then rated how relevant the 17 items were to the construct of interest on a fourpoint scale ranging from 0 (Not at all relevant) to 3 (Very relevant).
Next, they indicated how easy these items were to understand on a five-point scale ranging from 0 (Very Difficult) to 4 (Very Easy).Finally, they indicated on a binary scale whether the items were grammatically correct (0 = No; 1 = Yes) and provided demographic information.

Results and discussion
We calculated the level of agreement between raters similarly to Study 1a and found a good reliability of ratings in terms of relevance, ICC = 0.88, 95% CI [0.79, 0.95] (Koo & Li, 2016).Raters estimated the mean relevance of the 17 items to the construct of interest as just above the scale midpoint (M = 1.53).Again, we removed the items whose relevance to the focal construct was below the median.In this way, nine items remained for further testing, which was rated as relatively easy to understand (M = 3.33) and grammatically correct (M = 0.90).We reworded one item from "I believe that companies should always strive to offer labeled food" to "I believe that companies should always offer labeled food."Study 1 b thus yielded a final pool of 9 items for formal psychometric assessment in Study 2.

Scale refinement (Study 2)
Study 2 was the first formal psychometric evaluation of the selected nine items.Here we performed a Mokken scale analysis (Mokken, 1971;Van der Ark, 2007, 2012) and reported conventional internal consistency reliability indices.

Participants and procedure
Although there is no consensus in the literature on an appropriate sample size for scale development studies, it is generally recommended to recruit at least 300 participants at the scale refinement stage, with more participants reducing measurement error, and longer scales generally requiring more participants than shorter ones (Boateng et al., 2018).We therefore recruited 558 US participants from Qualtrics who declared no dietary restrictions.Given recent concerns about data quality on some crowdsourcing platforms (Eyal, David, Andrew, Zak, & Ekaterina, 2021), we used the recommended multistep approach to filter responses (Craft, Tegge, Freitas-Lemos, Tomlinson, & Bickel, 2022).We found that 118 participants failed either the "Captcha" question or an attention check question, described in more detail below.Further manual review of the data revealed that 15 participants gave nonsensical answers when asked to estimate the population of their country, such as "more" or "na."These 133 participants, or 23.8% of the original sample, were excluded for analysis to ensure high data quality.Thus, the final sample consisted of 425 participants (Mean age = 47.1, SD = 16.2, 51.8% females, 48.0% males, 0.2% did not identify with either of the two genders, mean annual income = 48,200 USD).
Participants first learned about FOP labeling and were shown an example of such a label used in the food industry.They then indicated their agreement with the nine items selected in Study 1 b on a nine-point scale ranging from 1 (Disagree strongly) to 9 (Agree strongly).Within this pool of items, which were presented in a random order, we embedded an attention check question ("After reading this question, choose the number seven using the slider below").After completing this questionnaire, participants provided their demographic data and free-text responses to another attention check question that asked them about the approximate population size of their country.Finally, they completed a third and final attention check question asking them to indicate the number of bicycles they saw in a "captcha" image.

Results and discussion
We began the psychometric evaluation of the nine items with a Mokken scale analysis, which revealed that these items formed a homogeneous and strong scale, H = 0.54, SE = 0.03 (Mokken, 1971;Van der Ark, 2007).We then used the automated item selection procedure algorithm to test whether these items were scalable with increasing homogeneity thresholds.The recommended minimum homogeneity threshold for items included in a scale is 0.30 (Hemker et al., 1995).However, we used a more conservative approach and increased our lower-bound homogeneity threshold to 0.50 (for a similar approach, see, e.g., Folwarczny, Li, et al., 2021).Two items did not meet this lower-bound homogeneity threshold and were removed from subsequent analysis.
We recalculated the homogeneity coefficient for the revised set of seven items and found that these items formed a homogeneous and strong scale, H = 0.60, SE = 0.03 (Mokken, 1971;Van der Ark, 2007).None of the items violated the monotonicity criteria, local independence, and invariant item order ordering; hence, all criteria for a double monotonicity model were met (Mokken, 1971).The internal consistency reliability of this scale was excellent, α = 0.90, ω = 0.91, 95% CI [0.89, 0.93], mean average correlation between items = 0.57.The scale's means ranged from 6.60 to 7.71 and standard deviations ranged from 1.32 to 2.01.The minimum mean was 1.57, while the maximum mean was 9.00, indicating that participants' scores spanned a broad spectrum of possible responses.
Study 2 showed that seven items met the criteria for a double monotonicity model and formed a strong scale (Mokken, 1971).This scale was further evaluated in Study 3a to ensure that the seven items measure the construct of interest (see Table 1, for the final set of items included in the CSFL scale).

Convergent and discriminant validity (Study 3 a)
Study 2 resulted in the development of a seven-item scale aimed at capturing consumers' susceptibility to FOP labeling systems (i.e., the CSFL scale).Although the previous study confirmed good psychometric properties of the scale, it remains unknown whether this newly developed instrument actually measures the construct of interest rather than other related constructs.Therefore, in Studies 3a-b, we compared the results of our scale against those obtained from several other instruments to test the convergent and discriminant validity of the CSFL scale.
Given that the construct to be measured by the CSFL scale should be related to thinking styles, pro-environmental consumption values, and impression management tactics, we used the "GREEN" scale (Haws, Winterich, & Naylor, 2014) Note.Participants responded on a nine-point scale ranging from 1 (Disagree strongly) to 9 (Agree strongly).
Teo, 2021), and both the Social Desirability (Stöber, 2001) and Impression Management (Bolino & Turnley, 1999) Scales to test the convergent validity of our instrument.To test the discriminant validity of the scale, we used the Anticipated Food Scarcity Scale, an instrument also developed for food-related contexts but used to measure an unrelated construct (Folwarczny, Li, et al., 2021).In addition, we included a brief measure of the Big Five Personality Traits (Gosling, Rentfrow, & Swann, 2003) as an exploratory measure.Study 3a was the first wave of the two studies to examine the testretest reliability of this instrument.To further test the generalizability of our results, we used a different crowdsourcing platform to collect data in Studies 3a-b than in Study 2 (Prolific Academic rather than Qualtrics, as in Study 2).

Participants and procedure
Consistent with the considerations from Study 2 for estimating sample size, we attempted to recruit at least 300 participants for the twowave study.However, because of the expected attrition rate of 15-20% based on our previous longitudinal studies with participants recruited through Prolific Academic, we aimed to recruit at least 15% more participants for Study 3a, such that the final sample in the second wave of data collection (Study 3 b) would still be around 300 participants.We recruited 351 US participants from Prolific Academic who declared no dietary restrictions.We used a similar approach to filter responses as previously.Here, 12 participants failed either the "captcha" question or an attention check question.A manual review of the free-text responses resulted in the removal of one additional response, resulting in excluding 13 responses, or 3.7% of the original sample.The final sample consisted of 339 participants (Mean age = 42.3,SD = 14.0, 48.2% females, 50.3% males, 1.5% did not identify with either of the two genders, mean annual income = 45,300 USD).
Participants were first given general instructions about the duration of the study and were informed that this study was the first of two waves of data collection, with 25% more financial rewards for taking part in the second wave.They then completed a series of psychometric instruments presented in a random order, as described below, and finally provided their demographic data.

Consumer susceptibility to FOP labeling
Participants completed the seven-item CSFL scale developed and described in Study 2 using the same nine-point response format.We averaged the responses to create a CSFL index (α = 0.93, M = 6.76,SD = 1.58).

Green consumption values
Green consumption values were measured using the "GREEN" scale, with this construct being defined as "the tendency to express the value of environmental protection through one's purchases and consumption behaviors" (Haws et al., 2014, p. 337).Participants indicated their agreement with six items (e.g., "I am concerned about wasting the resources of our planet") on a seven-point scale ranging from 1 (Strongly disagree) to 7 (Strongly agree).We averaged responses to create a GREEN index (α = 0.93, M = 4.54, SD = 1.34).Because FOP labels often signal product attributes such as sustainability, we expect green consumption values to be positively associated with CSFL.

Preferences for analytic versus holistic thought
The analytic thinking style, typical of Western cultures, revolves around focusing attention primarily on an object rather than the context in which a given object appears, as well as on logical, rule-based thinking; the holistic thinking, which is typical of East Asian cultures, is characterized by focusing attention primarily on context rather than on a particular object, and by an aversion to the use of categories and formal logic (Nisbett et al., 2001).We measured preferences for holistic versus analytic thinking styles using the 16-item Holistic Cognition Scale 1 (HCS; Lux et al., 2021).Here, participants indicated their agreement with the items such as "It is impossible to understand the pieces without considering the whole picture" using a seven-point scale ranging from 1 (Completely disagree) to 7 (Completely Agree).We averaged responses to the four previously established dimensions (Lux et al., 2021) to create an index of attention (α = 0.35, M = 4.95, SD = 0.74), an index of causality (α = 0.59, Mean = 3.72, SD = 1.01), an index of contradiction (α = 0.47, M = 5.10, SD = 0.79), and an index of change (α = 0.37, M = 3.39, SD = 0.78).We reverse-coded the items in the causality and change subscales; therefore, the higher scores in all HCS subscales correspond to more holistic (as opposed to analytic) thinking.
Because FOP labels convey information that is generally independent of context (i.e., other attributes of a product, such as its packaging), we expect all HCS subscales to be negatively associated with CSFL (i.e., positively associated with analytic rather than holistic thinking style).

Anticipated food scarcity
Anticipated food scarcity was captured by the Anticipated Food Scarcity Scale (AFSS) and is formally defined as "the perception of future food resources becoming insufficient in terms of availability and accessibility" (Folwarczny, Li, et al., 2021).Here, participants indicated their agreement with 8 items (e.g., "Food shortages will occur more frequently") on a seven-point scale ranging from 1 (Disagree Strongly) to 7 (Agree Strongly).Similar to the procedures above, we averaged these responses to create the AFSS index (α = 0.97, M = 4.96, SD = 1.52).We used AFSS because this scale captures food-related outcomes in domains that CSFL was not intended to cover.Therefore, AFSS was used to test the discriminant validity of CSFL, and we predicted a lack of correlation between these two scales.

Results and discussion
We first performed a psychometric evaluation of the CSFL scale in the same manner as in Study 2. The 7 items formed a homogenous and strong scale, H = 0.68, SE = 0.02 (Mokken, 1971;Van der Ark, 2007).
1 Because of the very low reliability estimates of the original four factors, we performed a factor analysis with oblimin rotation to extract two factors corresponding to the analytic and holistic thinking styles.However, because these two factors also yielded low reliability coefficients below α = 0.60, the results of this scale should be interpreted with appropriate caution.

M. Folwarczny et al.
We did not observe major violations of the criteria of monotonicity, local independence, or invariant item order ordering.Therefore, the criteria for a double monotonicity model were met again (Mokken, 1971).The internal consistency reliability of this scale was excellent, α = 0.93, ω = 0.94, 95% CI [0.92, 0.95], mean inter-item correlation = 0.65.The scale's means ranged from 6.01 to 7.42, and standard deviations ranged from 1.47 to 2.28.
Next, we calculated Pearson correlation coefficients between the CSFL scale index and the other measures included in this study, which are presented in Table 2.
The CSFL scale index showed negligible and low correlations with the other instruments.Positive, albeit weak, associations between CSFL and green consumption values are warranted because environmentally conscious consumers should generally make more informed choices than their peers who are less interested in such consumption (Haws et al., 2014), and FOP labels often communicate product attributes such as environmental impact and place of origin.Thinking styles, whether analytic or holistic, were weakly associated with CSFL, with no clear pattern of results.Weak and positive correlations between CSFL and all Big Five personality traits are surprising, although the link between CSFL and conscientiousness is grounded in theory, given previous research connecting this trait with an analytic information processing style (e.g., Gadassi, Gati, & Dayan, 2012).Importantly, the lack of a significant association between CSFL and AFSS, the latter measuring food-related outcomes that were not of interest in the development of the CSFL scale (which is also primarily intended for use in the food industry), is evidence of the discriminant validity of CSFL.

Convergent and discriminant validity (Study 3 b)
To prevent extensive participant fatigue and thus increase the chances that participants would be attentive to their assigned tasks, we split the data collection into two waves.We also wanted to investigate the test-retest reliability of the CSFL scale.
Therefore, participants who had taken part in Study 3a were invited to take part in Study 3b one week after the completion of the first wave of data collection, with both studies including a roughly equal number of items.

Participants and procedure
In Study 3b, we recruited 293 participants from Study 3a′s sample, meaning that the attrition rate was only 16.5%, and, accordingly, we found a high and thus desired response rate of 83.5%.Eight participants, or 2.7% of the total sample, failed the "captcha" question and were thus excluded from further analysis.The final sample consisted of 285 participants (Mean age = 43.6,SD = 14.1, 49.5% females, 48.8% males, 1.8% did not identify with either of the two genders, mean annual income = 45,600 USD).
The procedure in Study 3 b was similar to that of the earlier study with two exceptions.First, participants completed a different set of psychometric scales in addition to the CSFL scale.Second, we did not collect demographic data this time, as these data had already been collected in the previous wave of data collection.Below, we describe the measures implemented in Study 3b.

Consumer susceptibility to FOP labeling
Participants completed the CSFL scale again, and their responses were averaged to create a CSFL index (α = .94,M = 6.86,SD = 1.64).

Social desirability
We measured social desirability, defined as the tendency to give responses that are considered positive and desirable by others, using the SDS-17 scale (Stöber, 2001).Participants indicated whether each of the items (e.g., "I sometimes litter") was descriptive of them (1 = true) or not (0 = false).Following the scale author's instructions, we did not include the item on illegal drug use (Stöber, 2001).Participants' responses to the remaining 16 items were averaged to create an SDS index (α = 0.85, M = 0.57, SD = 0.26).Considering that consumers sometimes buy food to impress others (Folwarczny, Otterbring, & Ares, 2023) and that FOP labels may signal socially desirable qualities, we expect a positive relationship between CSFL and social desirability.

Impression management
Impression management is a process by which people consciously or unconsciously strategically change the way they are perceived by others (Schlenker, 2012).We used a scale developed by Bolino and Turnley (1999) that captures the five facets of impression management reported in the literature: self-promotion, ingratiation, exemplification, intimidation, and supplication.Participants read 22 items and indicated how often they behaved in the ways described (e.g., "Talk proudly about your experience or education").They did so on a five-point scale ranging from 1 (never behave this way) to 5 (often behave this way).We averaged items corresponding to self-promotion (α = 0.88, M = 2.31, SD = 0.93), ingratiation (α = 0.86, M = 2.91, SD = 1.00), exemplification (α = 0.68, M = 2.20, SD = 0.81), intimidation (α = 0.88, M = 1.42,SD = 0.61), and supplication (α = 0.90, M = 1.62,SD = 0.73) to create corresponding index variables.We expected a positive association between impression management and CSFL for the same reason as described above, i.e., consumers may be keen to buy products with FOP labels to impress others.

Results and discussion
We reevaluated the psychometric properties of the CSFL scale, with the results confirming good psychometric properties of the 7 items, which formed a homogenous and strong scale, H = 0.74, SE = 0.02, without major violations of the criteria of monotonicity, local independence, and invariant item order ordering (Mokken, 1971;Van der Ark, 2007).The internal consistency reliability of this scale was excellent, α = 0.94, ω = 0.95, 95% CI [0.94, 0.96], mean inter-item correlation = 0.70.The scale's means ranged from 6.12 to 7.50, and standard deviations ranged from 1.38 to 2.28.Next, we calculated the correlations between the scores of waves 1 and 2 to examine the test-retest reliability of the scale.We found a strong positive correlation across both waves, r = 0.77, 95% CI [0.73, 0.82], p < .001.The confidence intervals around the Pearson correlation coefficient indicate a substantial correlation between the scores from these two data collections (Cohen, 1988, p. 80).Thus, the CBLE scale is stable over time.
Finally, we calculated Pearson correlation coefficients between the CSFL scale and the other instruments (see Table 3).
Similar to the previous study, we found only weak and negligible correlations between the CSFL scale and scores on the other scales.Positive associations between the CSFL and social desirability and one of the facets of impression management suggest that consumers may be interested in labeled products to create a desirable image in the eyes of others, consistent with recent related research (Folwarczny et al., 2023;Gasiorowska, Folwarczny, Tan, & Otterbring, 2023).Considering that all of these correlations were weak, Study 3 b supports the notion that CSFL is associated with the desire to impress others, although it is a qualitatively different construct.

Criterion validity (Study 4)
Studies 2-3 b confirmed good psychometric properties of the CSFL scale.The relatively weak correlations between the instrument and other scales analyzed across Studies 3a-b suggest that the scale captures a new construct that is qualitatively different from numerous potentially related constructs.However, the predictive validity of the CSFL scale is yet to be examined.Therefore, in Study 4, we tested whether the CSFL scale predicts willingness to buy (WTB) products with FOP labeling systems.Because previous studies have shown that consumers generally respond similarly to fictitious and genuine labels (Sigurdsson et al., 2020(Sigurdsson et al., , 2022(Sigurdsson et al., , 2023)), we specifically sought to investigate whether the scale can predict whether consumers with high (vs.low) susceptibility to FOP labeling are more willing to buy products with genuine third-party certificates rather than products with fictitious labels or products without any certification.
We chose fish fillets as the product type to test the predictive power of our newly developed CSFL scale for several reasons.First, previous research has shown that certain FOP labels on fish products have minimal, if any, influence on consumer preferences (Hallstein & Villas-Boas, 2013;Sigurdsson et al., 2022Sigurdsson et al., , 2023)).Second, consumers often indicate that they value other seafood attributes, such as country of origin, more than FOP labeling when buying fish products (Sigurdsson et al., 2020).Third, whereas fish is not the predominant protein source among US participants, three-quarters reported consuming fish in the past 30 days (Jahns et al., 2014).Therefore, measuring the predictive validity of the CSFL scale in the context of fish products provides a conservative test of the scale's criterion validity, as FOP labeling is neither the highest rated attribute nor an effective strategy for increasing willingness to pay or daily frequency of consumption when it comes to fish.Habituation effects can therefore likely be ruled out as confounding factors for our potential results (Wathieu, 2004).

Participants and procedure
Estimating the desired sample size for research designs with repeated measures and interactions, where linear mixed models are the preferred analytic approach, is complex.Previous studies using similar research designs in food-related contexts have achieved satisfactory power to detect similar interaction effects as those examined in Study 4 with approximately 100-120 participants (Folwarczny, Christensen, Li, Sigurdsson, & Otterbring, 2021;Folwarczny, Otterbring, Sigurdsson, & Gasiorowska, 2022).Using a conservative approach, we aimed to increase this sample size by 50%.Consequently, we recruited 175 US participants from Prolific Academic who declared no dietary restrictions, two of whom (1% of the total sample) failed the "captcha" question and were excluded from further analysis.The final sample thus consisted of 173 participants (Mean age = 41.8,SD = 13.5, 46.2% females, 53.2% males, 0.6% did not identify with either of the two genders, mean annual income = 51,900 USD).
We used a within-subjects design in which participants were asked to complete a series of tasks.The first task was to indicate WTB nine fish fillets on a tray resembling the fillets offered in supermarkets ("How likely would you be to buy each of the 9 fish fillets if they were available in a grocery store that you usually visit.").Participants responded on a 201-point scale ranging from − 100 (Very unlikely) to 100 (Very likely).The fillets were presented in random order and differed in their FOP labeling (i.e., some fillets had genuine certificates, some had fictitious labels, and others were unlabeled).Two fillets had the American Heart Association certificate (the genuine, third-party certificate) in the upperright corner.Two fillets had a B-Corp mark in the upper-right corner, also a genuine, third-party certificate.Two fillets bore a fictitious label in the upper-right corner designed for this study (this label showed a blue fish with the words "CSAP certified" underneath).Finally, the remaining three fillets did not bear any FOP label.
In addition, participants completed the CSFL scale.We created a CSFL index by averaging responses to the items included in the scale (α = .94,M = 6.69,SD = 1.67).To avoid possible confounds from the task in which participants reported their WTB fish fillets, we separated these two tasks (i.e., evaluating fish fillets and filling out the SCFL scale) by adding a filler task in which participants reported their demographic data, completed an unrelated, 17-item questionnaire capturing dominance and prestige orientations (sample item: "Others do what I ask of them for fear of consequences") when seeking status (Körner, Heydasch, & Schütz, 2022), and answered the attention check questions, similar to the previously described studies.

Results and discussion
Because our data were nested and thus autocorrelations occurred across measurements, we used linear mixed models to perform the analysis with the lme4 and lmerTest packages for R (Bates, Mächler, Bolker, & Walker, 2015;Kuznetsova, Brockhoff, & Christensen, 2017).We added random intercepts for participants and product types (i.e., measurements).As predictors (fixed effects), we added the CSFL index and a label type (a genuine certificate, a fictitious label, and no label, with the "no label" category as the reference category in all analyses).The model also included the interaction term between the two predictors, with WTB fish fillets as the focal dependent variable.
In sum, the results of Study 4 show that the CSFL scale predicts WTB fish fillets with genuine, third-party certificates but not fish fillets with fictitious labels or fillets without any FOP labels, thus attesting to the predictive validity of our novel instrument.
Fig. 1 illustrates these key findings.

General discussion
The results from previous studies suggest that different segments of consumers are differentially prone to rely on FOP labeling when purchasing food (Clonan et al., 2012;Sigurdsson et al., 2020).However, no prior scale has been developed to capture consumers' susceptibility to FOP labeling systems.In this paper, we created such a scale.The CSFL scale captures a unidimensional construct and has good psychometric properties, which we confirmed in three studies.The CSFL scale also predicts WTB fish fillets with genuine, third-party certificates but not unlabeled fillets and those with fictitious FOP labels, thereby underscoring its ability to predict what it is assumed to predict.Moreover, given its narrow set of items, the scale is an appropriate instrument for academic research and retail practice.
Whereas some FOP labeling systems are viable tools to achieve desired consumer outcomes (e.g., Ares et al., 2018Ares et al., , 2021;;Sigurdsson et al., 2022), the effectiveness of such systems has been called into question (e.g., Hallstein & Villas-Boas, 2013;Temple, 2020).This discrepancy could be due to the omission of one potential moderator: consumers' susceptibility to FOP labeling systems.Thus, applying the newly developed scale to study consumer behavior could lead to more accurate policymaking regarding FOP labeling campaigns aimed at promoting outcomes such as healthier food choices.Indeed, previous research has found that messages tailored to individual differences (as opposed to one-size-fits-all approaches) considerably increase sales of targeted products (Matz, Kosinski, Nave, & Stillwell, 2017).Therefore, identifying consumers who are less susceptible to FOP labeling systems may encourage the development of alternative interventions that better target this segment.
The newly developed CSFL scale shows only weak correlations with other instruments, such as the GREEN scale, designed to capture consumers' propensity to purchase sustainable products (Haws et al., 2014).This finding suggests that the construct examined herein is not merely a ramification of a broader category of pro-environmental tendencies.Indeed, FOP labels vary considerably in terms of the information they provide, and these are not limited to environmentally friendly attributes (Golan et al., 2001;Hallstein & Villas-Boas, 2013;Kühne et al., 2022).Furthermore, the CSFL scale does not appear to be related to thinking styles when treated as analytic versus holistic, as we did not observe a clear pattern of correlations between the CSFL and thinking styles.
We found weak, albeit significant, relationships between the CSFL scale and social desirability as well as one of the impression management strategies-ingratiation (Bolino & Turnley, 1999;Stöber, 2001).Hence, it is plausible that the tendency to purchase products with FOP labeling is an impression management strategy used by a certain segment of consumers, which is consistent with previous studies (Gasiorowska et al., 2023).Importantly, CSFL was not associated with AFSS, which measures another food-related outcome (Folwarczny et al., 2023).This finding provides evidence behind the scale's discriminant validity.
Study 4 showed that the CSFL was positively associated with WTB fish fillets with genuine FOP.We used two third-party certificates here: B-Corp and American Heart Association labeling.However, we found no such association between the CSFL and WTB fillets that did not have FOP labels.There was also no such association for fish fillets with fictitious labels.This result suggests that the newly developed CSFL scale may not only predict consumer interest in FOP labels but also imply that this scale can discriminate between different labels.

Limitations and future research
Similar to all academic research, the current work is not without limitations.First, surprisingly, we found positive correlations between all Big Five personality traits and the CSFL scale.However, we used a very brief measure of the personality traits (Gosling et al., 2003), thus calling for further research using more extensive personality inventories.As indicated by the results in Tables 2 and 3, most of the correlation coefficients we observed were weak or negligible, although we found moderately strong and theoretically justified correlations between the CSFL scale and, for example, green consumption values (Study 3a) and social desirability (Study 3b), according to current conventions of what constitutes small, moderate, and large effects (e.g., Funder & Ozer, 2019;Gignac & Szodorai, 2016).The instances of weak correlations could be due to the fact that CSFL is a new construct and therefore has not been well explored in the literature.Another possibility is that the scope of the instruments we selected to test the convergent and discriminant validity of the scale is somewhat limited for this task.Future studies with additional literature-based instruments are essential to assess the convergent and discriminant validity of the CSFL scale more fully, considering that our empirical work merely provides an initial indication of these validity aspects.
Second, although the results from Study 4 are in line with our theorizing, we only measured WTB in one category of grocery products: fish fillets.It is critical to investigate whether our results can be generalized to other product categories, as the CSFL scale is intended to be used in various situations.Additionally and relatedly, the current research, particularly Study 4, measured expressed tendencies rather than real, observable behaviors.Without additional studies that preferentially apply the newly developed instrument in settings that are natural to consumers, the criterion validity of the scale has not yet been established (Baumeister, Vohs, & Funder, 2007;Doliński, 2018;Doliński et al., 2017;Otterbring, Viglia, Grazzini, & Das, 2023).
Third, we adopted a recommended multistep approach to ensure data quality across studies (Craft et al., 2022).As a result, in Study 2, we had to exclude nearly a quarter of the sample originally recruited through Qualtrics.In our remaining studies, where Prolific Academic participants were generally very attentive to instructions, exclusions were substantially smaller, supporting research that has demonstrated high quality from this crowdsourcing platform (Eyal et al., 2021).However, it remains essential to test the psychometric properties of the CSFL scale on different samples to rule out interference from possible sampling bias.A related concern about our sample selection is that our US participants likely do not consume fish on a daily basis, as over 80% of Americans do not meet recommended seafood consumption guidelines (Jahns et al., 2014).Consequently, subsequent studies in this area should measure the frequency of fish consumption among participants and strive to diversify their samples.This would allow verification that the effects observed in Study 4 are consistent across populations with different fish consumption habits, which is particularly important given the habitual nature of food purchases combined with cross-cultural differences in food consumption habits (Gatley, Caraher, & Lang, 2014;Guerrero et al., 2009;Machín et al., 2020).In the current research, we did not determine whether our participants were the main grocery shoppers in their households.Consumers who are regular grocery shoppers might respond differently to FOP labels than those who shop infrequently, possibly because of their familiarity and experience with such information.To account for this potential confounding factor, future research should inquire about participants' familiarity with FOP labels and the frequency of food purchases.At a minimum, such studies should determine whether a participant is the primary grocery shopper in their household.

Conclusion
The use of front-of-package (FOP) labeling systems produces mixed results in academic and non-academic settings.In some cases, such labeling schemes lead to desired behavioral changes whereas in others, these interventions lead to null results.In this work, we propose that consumers vary in terms of how susceptible they are to FOP food labeling systems.We, therefore, develop a 7-item instrument, which we refer to as the consumer susceptibility to FOP labeling (CSFL) scale.This tool was designed to measure the extent to which consumers rely on FOP labeling in their food choices and information search activities to meet their existing preference patterns.We confirmed good psychometric properties of the CSFL scale and conducted initial criterion validity tests, finding that the scale predicts WTB fish fillets with genuine, third-party FOP labels.We found no such effects for fish fillets with fictitious labels and no associations between the CSFL and WTB unlabeled fish fillets.Therefore, the scale may capture consumers' susceptibility to FOP labeling differently depending on the type of label (authentic vs. inauthentic).The relatively narrow set of items in the CSFL scale makes it ideal for academic and nonacademic research in food retailing.

Fig. 1 .
Fig. 1.Willingness-to-buy (WTB) by front-of-package labeling type.Note.WTB (y-axis) denotes willingness to buy.CSFL (x-axis) is an index of consumer susceptibility to front-of-package food labeling.

Table 1
, the Holistic Cognition Scale (Lux, Grover, & The final set of items in the consumer susceptibility to FOP labeling (CSFL) scale.

Table 2
Pearson correlation coefficients between consumer susceptibility to FOP labeling (CSFL) and other measures in Study 3a.