How well do surveys on adherence to pandemic policies assess actual behaviour: Measurement properties of the Dutch COVID-19 adherence to prevention advice survey (CAPAS)

Background: Survey data on adherence to COVID-19 prevention measures have often been used to inform policy makers and public health professionals. Although behavioural survey data are often considered to suffer from biases, there is a lack of studies critically examining the validity, reliability and responsiveness of population-survey data on behaviour throughout the COVID-19 pandemic. Aim: We studied the measurement properties of the COVID-19 Adherence to Prevention Advice Survey (CAPAS), a novel questionnaire implemented in a repeated cross-sectional (i.e., ‘Trend ’ ) Study and a Cohort Study in the Netherlands during the COVID-19 pandemic. Methods: The CAPAS is a novel questionnaire developed in March 2020, with the aim to assess social activity and adherence to COVID-19 prevention measures. Items were formulated to minimise social desirability and aid memory retrieval. Based on the COSMIN framework, we selected the most suitable test for each behavioural question. We investigated criterion validity of vaccination, testing behaviour and mobility by comparing (aggregate) trends of self-reported behaviour to trends in objective data. Responsiveness of mobility and ventilation behaviour was assessed by studying whether self-reported behaviour changed following contextual (e.g., policy) changes. Test-retest reliability of hygienic behaviour, wearing face masks, ventilation behaviour and social distancing was examined during a period in which the context was stable. Results: Overall, aggregate trends in self-reported behaviour closely corresponded to trends in external objective data. Self-reported behaviours were responsive to contextual changes and test-retest reliabilities were adequate. For infrequent behaviours reliability improved when measures were dichotomised. We were able to examine national representativeness for vaccination, which suggested a modest overestimation of on average 3.7%. Conclusions: This study supports the suitability of using carefully designed, self-reported surveys (and the CAPAS specifically) to study changes in protective behaviours in a dynamic context.


Introduction
Since the beginning of the COVID-19 pandemic, surveys have been developed to evaluate public views and adherence to COVID-19 preventative measures in various countries (e.g., in the UK (McBride et al., 2022), US (Bradley et al., 2021), Canada (Levitt et al., 2022), Austria and Germany (Seyd and Bu, 2022)).In the Netherlands, the Corona Behavioural Unit of the Dutch National Institute for Public Health and the Environment (RIVM) conducted a cross-sectional trend study and a dynamic cohort study that included the COVID-19 Adherence to Prevention Advice Survey (CAPAS), which measured various protective behaviours such as testing, hygienic behaviours and distancing.Such surveys have been frequently used to inform policy makers, communication campaigns and national media during the COVID-19 pandemic, in addition to the data being used for scientific articles (Betsch et al., 2020;Bradley et al., 2021).It is thus important that these self-reported behavioural data are valid and reliable (Betsch et al., 2020;Hansen et al., 2021) -yet few studies examined these measurement properties among general population samples.
Some studies investigated whether sets of items on different types of * Corresponding author.E-mail address: carlijn.bussemakers@radboudumc.nl (C.Bussemakers).
Contents lists available at ScienceDirect behaviours (e.g., hand washing, social distancing and wearing a face mask) jointly measure a latent construct of 'adherence' but did not study whether the individual items themselves adequately reflected participant behaviour (Kantor and Kantor, 2021;Morales-Vives et al., 2022).
Other studies have assessed the reliability of self-reported behaviour by comparing it to objective data, but only for specific behaviours on which external population data is available, such as mobility (Gollwitzer et al., 2021) or vaccination (Bradley et al., 2021) or in specific contexts where behaviour could be directly observed, such as preventive behaviour of healthcare professionals (Davies et al., 2022).We are not aware of any studies on the responsiveness of protective behaviour during the COVID -19 pandemic.Because issues such as recall, social desirability and unrepresentative samples could lead to bias, the use of behavioural survey data to study (changes in) behaviour among the general population has been criticised and researchers were urged to demonstrate the reliability, validity and responsiveness of their data (Bradley et al., 2021;Davies et al., 2022).This study therefore examined the measurement properties of the CAPAS, following the internationally recognized COSMIN Taxonomy of measurement properties (Mokkink et al., 2010).We investigated to what extent the behavioural measures in the CAPAS were reliable (free from measurement error), valid (reflecting actual behaviour) and responsive (reflecting change in behaviour over time).We employed a longitudinal approach by focusing on changes in average levels of self-reported behaviour, and stability of individual-level self-reports over time.In this way, we provide insight in the usefulness of longitudinal behavioural surveys for informing pandemic policy makers and public health professionals.Moreover, because the CAPAS was implemented in two surveys conducted among different samples, our study includes a direct replication of the tests of validity and responsiveness of the measures.

Study design
The CAPAS was developed in March 2020 to assess behaviour during the COVID-19 pandemic.It was included in two studies: a dynamic cohort study conducted among a convenience sample and a repeated cross-sectional trend study conducted among representative samples of the Dutch population.Both studies were carried out from 2020 to 2022 by the Corona Behavioural Unit of the RIVM (see Online Supplement 1 for more information).Steps were taken in the design of the survey questions to improve recall and reduce social desirability, arguably the most important causes of misreporting behaviour (Davies et al., 2022;Timmons et al., 2021).To improve recall, questions included specific contexts (e.g., handwashing before eating, or when visiting friends) or only asked about 'the last time you did x' (e.g., distancing during the last visit to the supermarket) (Gagné and Godin, 2005;Stirratt et al., 2015).To reduce socially desirable answers, questions were preceded by a short text introducing the behaviour, followed by several concrete reasons of why it might be difficult or not always possible for people to follow the guidelines (framing non-adherence as acceptable behaviour) (Stirratt et al., 2015).The questionnaire can be found on OSF (https://osf.io/kxfqw/).

Analyses and operationalization
We selected the most suitable measurement properties to test for each type of behaviour included in the CAPAS as indicated in Table 1, considering relevance and data availability.We prioritised tests comparing self-reported behaviour to objective data (criterion validity) (Mokkink et al., 2010), followed by testing responsiveness of questionnaire items by assessing whether objective contextual changes lead to immediate changes in reported behaviour.Third, we studied test-retests reliability using Cohort Study data.However, it should be noted that there was a time-lag of a few weeks between rounds of the cohort study, during which the pandemic circumstances could change.As a result, inconsistencies in answers may reflect actual changes in behaviour rather than low reliability of the measures.Therefore, we only conducted this test for habitual behaviours in a relatively stable time period.We have published a pre-defined analysis plan as well as the analysis codes of this study on OSF (https://osf.io/n85pd/).

Criterion validity and responsiveness
To assess criterion validity, we visually compared changes over time (trends) in self-reported behaviours to trends in external objective data.Table 2 provides an overview of the operationalization of the measures in the Cohort and Trend Study.The general hypothesis of these tests was that trends in behaviours measured in the Cohort and Trend Study correspond to those in the population as indicated by external data.It should be noted that only for vaccination of participants in the Trend study, and not for other behaviours, it was possible to also directly compare the share of vaccinated participants with formal registrations.Similarity herein depends on both the validity of the self-report and the representativeness of the sample.Moreover, the objective data source for COVID-19 self-testing was sales data, a proxy measure that indirectly assesses self-test behaviour (since people often purchase self-tests in larger quantities for later use and some groups such as students receive self-tests for free).
To assess responsiveness, we investigated whether levels of these behaviours increased (or decreased) when external circumstances became more (or less) favourable.An overview of the operationalizations of the behavioural data in the CAPAS and external data is included in Table 2, see Online Supplement 1 for more details.

Test-retest reliability
To investigate test-retest reliability, we focused on round 9 and 10 (winter 2020) because this was a relatively stable period during which relevant behavioural measures had been in place for a while.For ventilation, we focused on a similar period one year later (round 17/18, winter 2021/2022), as the Dutch policy on ventilation was set in place in autumn 2021.
All these measures except ventilation were measured on ordinal scales, so we assessed test-retest reliability using the Intraclass Correlation (ICC) of participants' answers in the two relevant rounds.Four questions on hand washing behaviour in different contexts were found to form a unidimensional scale using Mokken Scale Analyses (see Online Supplement 2), so we used participants' average score on these questions in the analyses.We used ICC consistency based on a two-way random effects model (participant and round) with a threshold of 0.7 (de Vet et al., 2011).Ventilation was a dichotomous measure, so we calculated Cohen's Kappa to assess agreement between participants' scores in the two rounds, with a threshold of 0.4 (de Vet et al., 2011).

Criterion validity
Fig. 1 shows the results of the criterion validity tests: the selfreported and objective data sources corresponded closely in terms of trends over time.Fig. 1a shows that the trend in vaccination uptake in the Trend Study was the same as in the population, in line with our hypothesis.However, contrary to the hypothesis, the Trend Study had a larger share of vaccinated people than the population, although the difference is small (on average 3.7% higher from December 2021 onwards).To explore this further, we also estimated the trends in vaccination per age group (Online Supplement 3) and found that the younger the age group, the larger the overestimation of vaccination rate in the Trend Study.
Fig. 1b and c shows that in both the Trend and Cohort Study, trends in tests reported by participants corresponded to changes in the number of tests performed at test facilities.For self-tests, shown in Fig. 1d and e, we found that the trends in the number of self-tests people performed diverged from the trends in sales of self-tests.The Cohort Study showed an early increase in the self-tests people reported to have done, occurring before the increase in tests sold.Conversely, participants in the Trend Study reported relatively high number of tests after the increase in sales, which dropped later.
Fig. 1f shows that the trend in confirmation tests reported by participants in the Cohort Study corresponded to the number of confirmation tests at test facilities.There is one minor exception: in fall 2021, the share of confirmation tests decreased in the Cohort, while the number of confirmation tests performed increased.Finally, Fig. 1g provides the trend in work-related mobility in the two surveys compared to Google data on this type of mobility, also showing high correspondence.

Responsiveness
Fig. 2 provides the results of our tests for responsiveness.Fig. 2a and  b shows that in both the Cohort and the Trend Study, the number of visits to bars, restaurants and or hotels, and to cultural venues participants reported decreased after restrictions were imposed and increased after policies were alleviated.Two restrictions appeared to have had little impact on self-reported behaviours: the closure of night clubs (a few days after alleviation of many restrictions) and restricting the maximum number of visitors of cultural venues to 2/3 of their total capacity, both in summer 2021.A likely explanation is that these restrictions were mild compared to other policies: even with these restrictions, the public had many opportunities to visit these venues.
Fig. 2c and d shows the trends in participants' reports on wearing a face mask in public indoor settings, which also corresponded to the expected impact of policy restrictions and alleviations in both studies.When face masks were advised (first restriction), a considerable group wore them regularly or always, with more people indicating always wearing a mask when this became mandatory (second and third restriction).When the policy was alleviated, there were large decreases in wearing face masks.Fig. 2e shows that trends in reported ventilation behaviour corresponded to the expected impact of changes in the average temperature outside: with lower temperatures, fewer people ventilated their homes in both studies.

Test-retest reliability
Table 3 shows that the ICCs of using a paper towel when sneezing

Face mask in public indoor settings
Share of people who wore a face mask in public indoor spaces (of people who visited these places) in the past week, indicated by never, seldom/sometimes, regularly/often/ most of the times and always.

8-15; 17-19
Share of people who wore a face mask in public indoor spaces (of people who visited these places) in the past week, indicated by never, seldom/sometimes, regularly/often/ most of the times and always.

2-25
Changes in policy on wearing face masks in public indoor spaces (RIVM).
a Vaccination could not be modelled in the same way as the other measures in the cohort study because it is an event that occurred at a specific time point.This posed additional requirements for the data and analyses which were beyond the scope of this study.
b The Trend and Cohort Study asked participants about testing behaviour during a different timeframe (3 vs. 6 weeks; corresponding to their different measurement frequencies).To compare trends as accurately as possible, we used the same timeframes for the external data.
C. Bussemakers et al. and the measures of hand washing behaviour were above the 0.7 threshold, indicating sufficient test-retest reliability.For sneezing or coughing in one's elbow, wearing a face mask in indoor settings and the measures of social distancing, the consistency across the two rounds was not sufficient.A possible reason for this low reliability may be that the questions used 6 or 7 response categories (never-always), while participants may have experienced these situations less frequently.As a result, different response categories may apply to the same behavioural patterns.To explore this, we performed an additional analysis in which we dichotomised the behavioural measures (never/seldom vs. regularly/ often/most of the times/always).The results show sufficient test-retest reliability for the dichotomised measures (Kappa >0.4), except for the measure of social distancing outdoors.Table 4 shows that kappa was also sufficient (>0.4) for ventilation.

Discussion
Much research and policy decisions on behaviour in the COVID-19 pandemic relied on self-reported adherence to prevention measures.Despite the risk of bias in large-scale population studies on behaviour (Bradley et al., 2021;Davies et al., 2022;Hansen et al., 2021), relatively few studies have critically examined the measurement properties of these self-reports.We studied these measurement properties for the CAPAS, a survey of adherence to COVID-19 prevention measures implemented by the Dutch Public Health Institute (RIVM).Our results provide sufficient to good evidence for the validity and reliability of the self-reported behavioural measures in the CAPAS.Longitudinal trends in vaccination and adherence to testing, mobility and ventilation guidelines in two population samples show good criterion validity and responsiveness when compared to objective, external data.Overall, results are in line with earlier studies indicating self-reported mobility followed the same pattern as external measures of mobility (Gollwitzer et al., 2021).Only for performing self-tests, this could not be established, which might be due to an indirect relationship between sales of self-tests and people actually performing a test.Furthermore, at the individual-level, self-reported behaviours show sufficient test-retest reliability (albeit with dichotomised measures for infrequent behaviours), except for social distancing outdoors.Our study thus shows that carefully designed self-reported measures of adherence to prevention policies can offer a sound basis for informing public health research and policy.
Although our analysis provides evidence for the reliability and validity of the measures, vaccine-uptake was found to be slightly overestimated in the Trend Study, even though this study employed a representative sampling framework and weighted the data according to sociodemographic characteristics.The small overestimation can be due to inaccurate self-reports or an overrepresentation of vaccinated people in the sample.Although we cannot negate the first explanation, we think the second played a larger role.Additional analyses indicated overestimation was largest among younger age groups who are most difficult to include in general population samples, while social expectations to get vaccinated might be largest for older, more vulnerable people.An earlier study found a large overestimation of self-reported vaccination among unrepresentative, big surveys in the US (11-20%) (Bradley et al., 2021).Although we found a much smaller discrepancy, these results indicate that survey studies are most suited to examine individual-level changes over time or differences between groups well-represented in the data, rather than estimating levels of behaviour in the general population.If the aim is to infer such population-level statistics, other methods than sociodemographic weights (such as weighing by vaccination status) have been suggested (Bradley et al., 2021).
Limitations of this study are that validity and responsiveness were investigated at the aggregate level, since no objective, individual-level measures of these behaviours were available.Moreover, it is possible that changes in policy may be associated with different patterns in selfreports not only because behaviours changed, but also because it affected social desirability of adherence behaviours.Furthermore, testretest reliability was difficult to study in the rapidly changing context of the COVID-19 pandemic, especially considering the relatively long time-lag between rounds, which may have resulted in an underestimation of test-retest reliability.On the other hand, the changing pandemic context enabled us to study responsiveness the CAPAS measures.Further investigation of the CAPAS could involve theory-informed tests of the associations between behavioural measures and other individual characteristics to assess its construct validity (Mokkink et al., 2010;Morales-Vives et al., 2022).Moreover, we also call for further research into the measurement properties of other behavioural surveys collected throughout the COVID-19 pandemic.
As a measurement tool developed to capture adherence to behavioural recommendations that evolved in a rapidly-changing context, CAPAS itself also has its limitations.The survey was initiated after the onset of the pandemic, and the inclusion of behavioural questions closely followed the Dutch' governments behavioural policies and advice.To better monitor behaviour, pre-pandemic baseline measures of behaviours such as handwashing and social activity, or pre-policy measures of behaviours such as the use of face masks would have been useful.In light of pandemic preparedness, measuring these behaviours in the general population in periods when no behavioural recommendations are in place are valuable.Furthermore, further research could explore alternative response scales on some questions to better capture the frequency of advice-consistent behaviour.In CAPAS, the use of Likert scales for questions on frequency of recommended hygienic behaviours had the advantage that they could be answered relatively easily by almost all participants.However, these responses only provide indirect information on the number of times participants had performed behaviours.Further development of self-report measures, for example via intensive longitudinal designs employing digital  In conclusion, our results support the validity, at the aggregate level, and reliability of the CAPAS in two large, independent samples.This supports the usefulness of carefully designed behavioural surveys to study (changes in) adherence to prevention guidelines as well as their use to inform health policy and practice.Furthermore, CAPAS could serve as a standard tool for monitoring adherence to prevention behaviours during infectious disease outbreaks.

Table 1
Overview of measurement properties included in the study.

Table 2
Operationalization of variables for criterion validity and responsiveness tests.

Table 3
Test-retest reliability for ordinal measures.

Table 4
Test-retest reliability of ventilation.capturing technologies, could lead to improved precision.Additional reflection on the development and content of CAPAS can be found in Online Supplement 4. behaviour