Introduction

Current estimates suggest that just over half the world’s population have access to the Internet via computers and smartphone devices, and that the average daily Internet usage per capita is around 3 h (Statistica, 2021). Access and usage vary by region, with estimates ranging from around 80–94% for first world countries such as Australia, the UK, and USA, to 60–72% for countries with emerging economies such as China, Russia, and Turkey, and much lower rates of < 30% for poorer countries such as Indonesia, Pakistan, and Burkina Faso (Pew Research Centre, 2016). Across regions, it is also apparent that younger adults (18–34 vs. 35+), those with more education and higher income, and those who are male are more likely to report using the Internet at least occasionally (Pew Research Centre). Even so, the global trend is for those with access to use the Internet at least daily (Pew Research Centre). Hence, the sheer volume of people accessible via online means has encouraged development of online platforms, such as Amazon Mechanical Turk (Crowston, 2012) and Prolific (Palan & Schitter, 2018), to capitalize on Internet usage for research purposes.

This Internet-based survey research has become vital to psychology and the social sciences (Gosling & Mason, 2015). The instrumental utility of online questionnaires for research rests upon an assumed correspondence between the self-reported and actual levels of a construct being measured (Collins, 2003; Lietz, 2010). Unfortunately, evidence shows that a participant’s propensity to respond either diligently or perfunctorily is influenced—at least in part—by the burden of confusing or unclearly worded questionnaire items (e.g., Lenzner, 2012; Smyth & Olson, 2018). For instance, Lenzner (2012) demonstrated experimentally that participants who received less comprehensible survey items were more likely to break off from the survey (i.e., exit prior to completion), utilize nonsubstantive response options (e.g., ‘Don’t know’ or simply leave a response blank), and were less consistent over time in response to the same questions than participants who received items that were easier to comprehend. This compromises the quality of survey response data, potentially introducing measurement error that distorts correlations involving affected variables, and reduces confidence in the generalizability of estimates from descriptive statistics (DeCastellarnau, 2018; Maniaci & Rogge, 2014) Fig. 1.

Fig. 1
figure 1

Indicates an example of a brief help requested episode with the chatbot. Each the three panels is comprised of a conversation history pane (above) and a text entry field for user input (below). Messages from the chatbot are aligned with the left margin, white user-submitted message are right-aligned and visually delineated by a speech bubble. Text appearing in the user input field at the bottom of each panels has not yet been sent to the chatbot. Conversation flow progress left-to-right across the three panels, and top-to-bottom within the conversation history pane of each panel. Note the chatbot’s robustness to casual or vague user input, including an incorrect spelling of the item being queried (i.e., joi viv instead of joie de vivre).

While these problems may arise for surveys delivered face-to-face, they are especially salient in online contexts, where investigators are not on hand to address participant confusion. Meanwhile, autonomous conversation agents called chatbots are increasingly being deployed in other online domains (virtual classrooms, technical support, government and business websites) as an effective and scalable means of approximating user support interactions (Androutsopoulou et al., 2019; Georgescu, 2018; Thorne, 2017; Zumstein & Hundertmark, 2017). Although this chatbot technology has the potential to mitigate the influence of questionnaire item confusion on the quality of survey response data, there is scant prior literature on adapting these technologies for that purpose. As such, this pilot study will investigate whether supplementing an online questionnaire with an assistive chatbot might improve the quality of elicited response data, and will explore participants’ adoption of—and experiences with—this help feature.

Questionnaires and data quality

Online questionnaires are versatile tools that circumvent the barrier of geographical distance which separates researchers from a diverse global pool of potential participants. These questionnaires can be administered at large scales without incurring the considerable costs of face-to-face survey delivery (Gosling & Mason, 2015). Further, Internet-delivered instruments largely eliminate the error-prone and labor-intensive requirement for investigators to manually input research data into statistical software (Gosling et al., 2004; Riva et al., 2003). Despite the merits of online questionnaires, their value and utility ultimately rest on the assumption that participants will read, understand, and accurately respond to questionnaire items. These assumptions are not always satisfied (DeCastellarnau, 2018; Krosnick, 1999).

A participant’s response accuracy (correspondence between reported and actual levels of the construct being measured) can be influenced by a range of factors. One such influence is the order of item presentation. For instance, within the context of personality research, it has been demonstrated that presenting general items before more specific ones tends to lead to lower satisfaction ratings than if the specific items are presented first (Kaplin et al., 2013). Similarly, for individuals with elevated health risks, presentation of specific items about health domains prior to global assessments of health status tend to produce worse self-rated health (Garbarski et al., 2015).

The characteristics of a given questionnaire item may also influence response accuracy (DeCastellarnau, 2018; Van Vaerenbergh & Thomas, 2013). This happens when the design of a question interferes with the cognitive processes involved in answering it (Lietz, 2010). Commonly cited cognitive models emphasize that completing each item involves inferring the question’s objective (comprehension), reflecting on cognitions—thoughts, feelings, remembered experiences—that are relevant to gauging subjective levels of the target construct (retrieval), evaluating whether candidate responses satisfy the stimulus query (judgement), then translating the decided answer into an appropriate outcome (response) captured on the questionnaire (Lietz, 2010).

Questionnaire items that incorporate certain features (obscure words, ambiguous phrases) have been shown to significantly increase the time and effort required to comprehend questions, which interferes with the very first step in the response process (Lenzner et al., 2010, 2011). This lack of clarity impedes participants’ ability to quickly interpret what is being asked, making accurate responding difficult (Hamby & Taylor, 2016), even for motivated, conscientious individuals (Anson, 2018; Behrend et al., 2011; Hauser & Schwarz, 2016). Moreover, the confusion can induce respondents to engage in suboptimal behaviors, such as skipping items and utilizing nonsubstantive response options (such as ‘Don’t know’ or ‘Not applicable’) (Lenzner, 2012).

Confusion-induced responding poses a considerable threat to the integrity of online survey research by artificially inflating or deflating observed means, score variability, and the strength of correlations, as well as eroding the statistical power of analyses (Baumgartner & Steenkamp, 2001; Maniaci & Rogge, 2014; Van Vaerenbergh & Thomas, 2013). It is possible to lessen the impact of such errors to some extent by incorporating attention-checking items into questionnaires to detect unsound response data (Abbey & Meloy, 2017; Curran, 2016; Niessen et al., 2016; Oppenheimer et al., 2009; Paas et al., 2018). However, the effective prevention of item confusion—and its erosive effect on data quality—is hindered considerably by the sheer scale and remote nature of the online context, where the survey setting is far removed from researchers who could quickly provide needed clarity.

Past efforts to reduce the burden of responding have targeted the structure of questionnaire items, such as by separating the elements of grid items into discrete questions (Roßmann et al., 2018). Others have instead attempted to dissuade speeded, inaccurate responses through the use of implied observers and admonitory survey instructions (Ward & Pond, 2015). Yet there has been no research assessing the effectiveness of empowering participants to improve their understanding by directly seeking clarification. One possible approach to enhancing item clarity might thus be the provision of resources that enable survey respondents to inquire about the meanings of confusing questionnaire items, thereby facilitating greater response accuracy. Although this is typically actioned via an attendant researcher in phone-based or face-to-face settings, such a resource could potentially be approximated in online surveys through the use of a chatbot.

Scalable intelligent support

A nascent area of research inquiry is the use of technology-enabled approaches to facilitate research conduct and dissemination. Such approaches have been applied at: (1) the back-end to identify and remove poor quality data following survey completion yet prior to intended analyses, and (2) the front-end as a substitute or complement to human-driven interactions, in order to cost-effectively facilitate dispensing of psychoeducation and treatment resources, and to help match the appropriate treatment resource to the individual.

Several statistical approaches and automated programs have been developed to identify patterns in collected data that may signal careless responding. Meade and Craig (2012) defined 12 possible indices to detect poor quality data, including time taken to complete the survey, total score across a number of sham items used to detect careless responding, correlations among item pairs tapping into the same construct, response to attention items, and number of identical responses in a row. Subsequent research has utilized bots to detect response characteristics such as those defined by Meade and Craig. For instance, Buchanan and Scofield (Buchanan & Schofield, 2018) developed a Google Chrome plug-in to automatically code respondent data that were suspicious in terms of time taken to complete each page of the online survey, performance on manipulation check items, distribution of participant responses and number of response options used, and click count to detect engagement with the items on each page of the survey. When using a cut-off of at least two indicators of untrustworthy data (out of a possible five), the authors found that their algorithm had excellent sensitivity and specificity, detecting 100% of automated data designed to be problematic and 99% of low effort data, and falsely flagging only 2% of the high effort data. It was argued that this may be used as an effective and quick way to filter out data subsequent to data collection but prior to analysis.

Chatbots have been employed at the front-end to engage directly with participants as they engage with one’s product (whether an intervention, healthcare service, or survey). Chatbots are autonomous conversational agents underpinned by algorithms that enable computers to learn human languages through examples and experience (Natural Language Processing [NLP]; Bengio et al., 2003; Cambria & White, 2014; Chowdhury, 2003; Nadkarni et al., 2011). These agents then use their acquired capabilities (learned patterns for linguistic usage; Tait & Wilks, 2019) to autonomously engage in conversational exchanges; a chatbot is effectively a computer-driven technology that approximates human interaction—you communicate with it and it responds. As robust and highly scalable intelligent technologies, chatbots are becoming ubiquitous on the Internet in a variety of conversation-based roles (virtual assistants, customer service agents; Chakrabarti & Luger, 2015; Hasler et al., 2013; Keeling et al., 2010; Radziwill & Benton, 2017).

A recent review of the literature underscores both the potential for wide-ranging psychological applications of chatbots and the current paucity of empirical evaluations of their utility (n = 10 studies) (Vaidyam et al., 2019). Three studies demonstrated the efficacy of chatbot-disseminated treatment content for alleviating depressive and anxiety symptoms (Fitzpatrick et al., 2017), reducing stress-related alcohol consumption and improving fruit consumption (Gardiner et al., 2017), and enhancing psychological well-being and reducing perceived stress (Ly et al., 2017). However, each of these studies comprised non-clinical samples, and the comparison condition comprised self-guided information resources or a wait-list control period.

Evaluations among clinical populations have demonstrated utility of chatbots for identification of clinical symptoms of depression (Philip et al., 2017) and posttraumatic stress disorder (Lucas et al., 2017), dispensing of discharge information to patients following a period of hospital stay (Bickmore et al., 2010a, b), and adherence to prescribed medication among patients with schizophrenia (Bickmore et al., 2010b). The use of chatbots for military service members with potential PTSD symptoms was demonstrated to elicit more symptom information than face-to-face interviews (Lucas et al., 2017), suggesting that under some conditions (preference for anonymity, anxiety around people, etc.) chatbots may be particularly advantageous. Across these reviewed studies found generally high acceptability and reported ease of use of the chatbot. Less is currently known about patient safety implications of chatbot use, though Vaidyam et al.’s review highlights very limited adverse reactions (two complaints out of ~800 cases) from the few studies that have documented this.

Autonomous conversational agents offer several potential benefits to researchers, from reductions in research costs (e.g., personnel remuneration), to the enhancement of available support for remotely located participants (Zumstein & Hundertmark, 2017). Being deployed over the Internet to interact autonomously and intelligently with human survey respondents is precisely the type of problem that chatbots have been designed and demonstrated to solve (Araujo, 2018; Ciechanowski et al., 2019; Clément & Guitton, 2015; Inkster et al., 2018; Pereira & Díaz, 2019). Indeed, interest in the potential applications for intelligent technologies within research contexts is growing, such as the employment of textual or embodied (i.e., graphically simulated) agents as virtual research personnel, and for approximating the administration of face-to-face conversational interviews (Conrad et al., 2015; Hasler et al., 2013). Despite these investigations however, the use of assistive text-based chatbots for mitigating questionnaire item confusion to improve data quality in online survey research remains largely unexplored.

The present study

Thus, the aim of the present study is to explore the data quality, feature adoption, and participant experience associated with using a chatbot to assist completion of an online survey. The investigation seeks to answer three research questions (RQ): (1) will participants access the help feature if available? (2) do chatbot users perceive the feature as helpful? and (3) does chatbot use improve data quality?

Propensity to use the chatbot (RQ1) will be measured as the proportion of participants in the chatbot condition who choose to initiate a valid query (pertaining to present questionnaire) via the provided help feature. Meanwhile, participant satisfaction with the chatbot (RQ2) will be gauged via optional user feedback; this exploratory research question is primarily aimed at informing future research and practice. Finally, the influence of chatbot use on data quality (RQ3) will be determined through between-group comparisons of response accuracy, as assessed independently for data from the chatbot and control groups.

Response accuracy is operationalized as the strength of positive correlation between a validated measure with known properties (target variable), and a tailored test item (challenge item) that has been formulated to correlate strongly with that target (mirroring its scale and meaning) while deliberately incorporating item characteristics that induce item confusion and predispose satisficing behaviors (e.g., obscure words, ambiguous phrases; Lenzner et al., 2010). Postulating that confusing items increase measurement error and weaken correlations, a comparison of the relationship between challenge items and their target variables should yield different strengths of association for individuals who gain assistance to resolve their confusion, than for those who receive no support. A group disparity in data quality is therefore defined as a significant difference between two corresponding Pearson’s correlations observed independently in the data from each condition (chatbot, control) for an identical pairing of a given challenge item and target variable (employing Fisher’s r-to-z transformation to enable comparisons in standardized units with a normal distribution; Fisher, 1915).

Participants were randomized to either a chatbot or self-guided online survey completion condition. In both instances, a subset of the items was intentionally worded in a vague and confusing manner to impact data quality (control condition) and encourage chatbot use (chatbot condition). It was predicted that individuals in the chatbot condition would:

  1. 1.

    Use the chatbot feature specifically for these ambiguous items;

  2. 2.

    Report satisfaction with use of the chatbot; and

  3. 3.

    Demonstrate improved data quality relative to the control condition.

Method

Participants

A sample of 300 English-speaking adults representing the general Australian population were recruited into the present study. This was based on power calculations with power set at .80 and alpha at .05 (two-tailed) showing that a small, non-trivial standardized mean differences (d > .32) could be detected with 150 per group. This was deemed sufficient, balancing desire for a well-powered sample to detect meaningful effects against unnecessary oversampling and likelihood of significant results due to an over-powered study.

Table 1 displays the distribution of participants demographics across experimental conditions. Also included are results from chi-square analyses conducted to assess the significance of demographic differences between the chatbot and control groups. Fisher’s exact test (Fisher, 1915) Exact Sig. (two-sided) indicated that the experimental conditions were demographically comparable, revealing no significant between-groups differences with respect to age group (p = .062), gender identity (p = .325), or educational attainment (p = .928).

Table 1 Distribution of participant demographics across experimental conditions

Comparing chatbot users and abstainers

Despite random allocation across chatbot and control conditions (with 1:1 allocation), 60 of the respondents allocated to the chatbot condition subsequently opted to abstain from engaging with the experimental manipulation. A within-group Chi-square analysis was conducted using Fisher’s exact test to assess whether individuals who used the chatbot differed from the abstainers in terms of demographic factors. Fisher’s exact test Exact Sig. (two-sided) indicated that those who engaged with the chatbot differed significantly from those who did not on the factor of age group (χ2 = 12.19, p = .02, V = .29), but not on the factors of gender identity (χ2 = 2.01, p = .36, V = .13) or educational attainment (χ2 = 10.40, p = .13, V = .26). Seventy of the 89 valid cases (79%) in the chatbot group were from age groups 18–25 (n = 32) and 26–35 (n = 38); rates of active participation were lower in the abstaining group (55% overall; n = 11 for 18–25 and n = 22 for 26–35). Thus, chatbot use abstainers tended to be older (see Table S1 in the supplementary file). The two groups also did not differ in terms of overall well-being score (Musers = 61.86 and Mabstainers = 61.17; t(147) = .225, p = .823, Cohen’s d = .04), extraversion (Musers = 2.92 and Mabstainers = 3.16; t(147) = .988, p = .325, Cohen’s d = .16), or conscientiousness (Musers = 4.72 and Mabstainers = 4.85; t(147) = .538, p = .591, Cohen’s d = .09).

Limiting demographic comparisons to those chatbot participants who used the feature (n = 89) and a propensity score matched subsample of the control group (n = 89), the pattern of non-significant demographic differences for age, gender identity, and educational attainment reported in Table 1 for the whole sample replicated for this subsample (see Table S2 in the supplementary file for further details).

Apparatus

Chatbot

An autonomous conversational agent was specifically designed and developed for this study by the first author using the IBM Watson Assistant service. The chatbot’s knowledge domain was constrained to a subset of the present questionnaire (see measures overview below), along with anticipated queries related to its use (e.g., privacy, meaning of words, etc.). Primary design and performance criteria included (i) absolute quarantine from participant data; (ii) sensitivity in entity identification (specific word or item being queried); (iii) specificity in user intent discrimination (seeking definition, clarification, information); (iv) accuracy in response selection; (v) consistency in response delivery; and (vi) curtailment of irrelevant digressions during conversations. The preclusion of behaviors that might make the chatbot an unanticipated source of measurement error (varying explanations between users, allowing or enabling distraction from the questionnaire) were a particular focus during the iterative training and testing process.

We were deliberate in our decisions about how to incorporate the chatbot into the questionnaire. When not in use, both conditions saw identical page layouts except for an unobtrusive button for summoning the chatbot. When summoned, the chatbot took up the whole screen until dismissed. This was done to ensure: (1) a uniform chatbot user experience regardless of screen size, and (2) that the presentation of and interactions with the questionnaire itself would be identical for every participant across both conditions.

Questionnaire

An online questionnaire (Qualtrics) was configured to integrate the chatbot with one of two survey participation conditions. The chatbot condition provided optional access to chatbot assistance during questionnaire completion, while the control condition was unassisted. Aside from chatbot access and feedback items specific to chatbot use, the questionnaire content and layout were identical across conditions.

We could not guarantee that participants would complete the study on a computer rather than a mobile device. Several steps were taken to mitigate the risk that participant experience might differ according to device used to access the survey. The questionnaire was developed in accordance with Qualtrics’ instructions for mobile optimization. Custom code was added to implement an intuitive but uncluttered user interface that looked and worked the same way on both small and large screens. Testing (n = 5 end users) confirmed the functional equivalence between mobile and desktop in terms of design, user experience, and integration of the chatbot.

Measures overview

In order to establish the potential influence of chatbot use on data quality, the present study required measures that were expected to correlate based on a body of prior studies. For this purpose, validated constructs of subjective well-being and life satisfaction were chosen from the well-being literature (Personal Well-being Index [PWI]; Cummins et al., 2003), and have been shown to correlate strongly with each other (Cummins et al., 2018). Personality correlates of well-being—as demonstrated in a recent meta-analysis (Anglim et al., 2020) —were also included as an additional check on response accuracy (Ten-Item Personality Inventory [TIPI]; Gosling et al., 2003). The conscientiousness construct in particular provides further opportunity to evaluate whether individuals differ in their use of the chatbot, with the expectation that more conscientious individuals will be more compliant with study instructions (Bowling et al., 2016). Further, challenge items devised to impact data quality were worded to mirror the well-being constructs such that, absent item confusion, these items would be expected to correlate with the target variables. Measures unrelated to data quality were based on chatbot session logs and user feedback.

Challenge items

Two challenge items were crafted to correlate strongly with the PWI variables by approximating their underlying meaning and scales, while deliberately incorporating syntactic and semantic features that increase difficulty in responding (Lenzner et al., 2011; Lietz, 2010). One item (ITEMAMBIG) employed ambiguous phrase structure by conflating several disparate life domains (“I am satisfied with my work, home life, and relationships”). In construction of the item, we sought to follow the examples of double-barreled questions by asking about several distinct things but offering the participant only a single response. We chose three domains (work, home life, and relationships) that are common to answer, but that a participant may have different levels of satisfaction with.

The second item (ITEMOBSCURE) invoked obscure terminology by roughly paraphrasing a statement of joyous life satisfaction using foreign words (“I feel a sense of joie de vivre when thinking about my life”). We chose an uncommon phrase to increase the challenge of the item, with the expectation that lack of familiarity with the phrase, and wording that did not provide context to guess the meaning, would produce the confusion we sought to elicit. Google Books Ngram Viewer confirmed the infrequency of this term in books in its library (< 0.00002% frequency).

The items were scored on a scale identical to that of the PWI (cf. Target variables below) except for textual anchors modified to fit the item wording (“Completely Disagree”, “Completely Agree”). Higher scores indicate greater satisfaction.

Target variables

PWI

The PWI is a short inventory that measures perceived satisfaction across each of seven life domains (living standard, health, achievement, relationships, safety, community, security; Cummins et al., 2003) along an 11-point scale (0-to-10, delimited by anchors “Not Satisfied”, “Completely Satisfied”). These core domains yield a single score (PWISWB) representing an individual’s subjective well-being (α = .70) (International Wellbeing Group, 2013). An auxiliary item (“How satisfied are you with your life as a whole”) provides an additional—and correlated—measure of general life satisfaction (PWILIFE). Higher scores indicate greater levels of subjective satisfaction with the life domain being measured. The validity of this measure has been established previously, with support for a single factor structure and demonstrated predictive utility of the PWI for assessing global life satisfaction in a large Australian sample of 45,192 adults (Richardson et al., 2016).

TIPI

A four-item subset of the Ten-Item Personality Inventory (Gosling et al., 2003) was used to measure levels of trait extraversion (TIPIEXTRV) and conscientiousness (TIPICONSC). Each item used a seven-point scale (“Disagree Strongly”, “Agree Strongly”) to record the extent of a participant’s self-identification with trait-specific attributes (e.g., “I see myself as reserved, quiet”). Higher scores indicate greater identification with the trait attribute being measured. The TIPI is optimized for content validity rather than internal consistency, and due to having only two items per scale, reliability estimates appear low for both the Extraversion (α = .68) and Conscientiousness (α = .50) scales (Gosling et al., 2003). The initial validation study for the TIPI demonstrated acceptable model fit for the proposed factor structure, strong correlation with the same constructs using a longer, previously validated personality measure, strong test–retest reliability over 2 weeks, external validity in terms of correlations with constructs previously linked to personality (e.g., self-esteem and depressive symptoms) (Gosling et al., 2003). Factor structure and convergent validity have been confirmed subsequently on new samples (e.g., Ehrhart et al., 2009; Myszkowski et al., 2019).

Feedback

Optional feedback items at the end of the questionnaire gauged participant satisfaction with the chatbot. Items measured ease of use (“I found it easy to use the help feature”), perceived utility (“The help feature made it easier to understand and answer the questions”), and attitudes toward wider chatbot availability in online surveys (“I would like more questionnaires to provide this kind of help feature”). Participant responses were captured on a non-numerical scale delimited by textual anchors indicating either negative or positive feedback (“Strongly Disagree”, “Strongly Agree”), and bisected by neutral response option (“Neither agree nor disagree”). Scores for feedback items were used separately to represent three dimensions of participant satisfaction (ease of use, utility, feature acceptance).

Usage

A count of chatbot interactions with unique users was automatically generated by the IBM Watson Assistant service that powers the chatbot, quantifying the questionnaire sessions during which a participant submitted at least one query to the chatbot. Chatbot activity logs record the particular questionnaire items targeted by valid queries (entities that the chatbot’s underlying neural network is trained to recognize), along with the form of assistance requested (the user’s intent—defining words, clarifying phrases).

Procedure

Participants were recruited via Prolific Academic (www.prolific.ac), a crowdsourcing platform providing access to pre-screened participants curated to satisfy the ethical and methodological requirements of academic researchers (Palan & Schitter, 2018; Peer et al., 2017). Survey invitations were distributed via Prolific. Interested respondents were redirected to Qualtrics, whereupon consenting participants were randomized across chatbot and control conditions. Both conditions completed a questionnaire comprised of demographics (age, gender, education) and measures items (PWI, TIPI, challenge items) in the same order. Aside from chatbot-specific instructions and feedback questions, both conditions employed identical content.

While the control group progressed unassisted from one section to the next until done, the chatbot group was notified about the help feature before entering the measures section, and could repeatedly summon and dismiss it as needed via an on-screen button. Chatbot use was entirely optional, and the help feature was only accessible for items within the measures section. Participation concluded upon questionnaire completion. In accordance with a minimum pro rata hourly compensation stipulated by Prolific to preclude exploitation, participants were reimbursed with a nominal payment unlikely to induce coerced or risky behavior. Involvement in the study remained entirely anonymous and voluntary, and respondents were free to discontinue at any time. Ethics approval was obtained prior to conducting the study. Informed consent was obtained in advance, and the privacy rights of all participants were observed. While ethics approval does not permit public availability of participant data, coding of the chatbot may be requested to the corresponding author.

Data analysis

Prior to analysis, data underwent assumption checking. There were no missing values for survey items. The data distributions for all variables after cleaning were sufficiently normal, based on cut-offs for absolute skew and kurtosis (Mishra et al., 2019). Several outliers were identified, but these were low in number (< 1% of overall sample), and scores were within possible scale ranges. As such, these cases were retained.

A series of analyses was conducted to address the three key research hypotheses. Chatbot usage (Hypothesis 1) and user satisfaction (Hypothesis 2) relied on descriptive statistics from those in the chatbot condition who utilized the chatbot feature. As a substantial portion of individuals within the chatbot condition (60 out of 149) did not utilize the chatbot function, two strategies were used for evaluation of group differences between chatbot and control participants in correlation strength (Hypothesis 3). First, an intention-to-treat (ITT) approach was utilized, in which correlations were compared across groups using the whole sample. This retains randomization, but likely underestimates group differences since it includes people in the chatbot group who did not use this feature. A second approach to augment these ITT results was to limit the chatbot group to those who used the chatbot feature (n = 89) and use propensity score matching in an attempt to balance the chatbot and control groups. We used the R package MatchIt (Ho et al., 2011) for propensity score matching, with 1:1 nearest neighbor matching across groups based on demographic variables available within our dataset (age, gender, educational attainment). As shown in Table S2 (supplementary file), matching resulted in non-significant differences between groups for these demographics, mirroring non-significant group differences for the whole sample produced by randomization (see Table 1).

Results

Hypothesis testing

H1: Chatbot usage

On average, individuals in the chatbot condition took longer to complete the survey (M = 5.83 mins, SD = 4.85) relative to the control group (M = 3.92 mins, SD = 4.85); t(238) = 2.94, p = .004, Cohen’s d = 0.39. Eighty-nine participants (60% of initial chatbot group) chose to utilize the chatbot. Query validity was verified via aggregated usage data from the IBM Watson Assistant service. Overall, there were 251 individual messages received by the chatbot across 89 unique query sessions (M = 2.82 messages). Approximately 87% of queries targeted challenge items (ITEMOBSCURE = 69 queries [78%]; ITEMAMBIG = 8 queries [9%]). Meanwhile, only about 4% of queries addressed PWI items, and another 9% were irrelevant.

H2: User satisfaction

Table 2 summarizes feedback from chatbot users, revealing the distribution of negative, neutral, and positive sentiments relating to their user experience. The aspect of chatbot use that received the highest frequency of negative responses was ease of use, but this accounted for less than one-sixth of users. Chatbot usefulness and the desirability for wider chatbot availability each attracted fewer unfavorable responses. Although a considerable number of participants responded neutrally on each feedback item, the majority of feedback across every user experience dimension was positive.

Table 2 Summary of Participants Feedback about their User Experience with the Chatbot

H3: Data quality

Bivariate correlation analyses were conducted separately for ITEMAMBIG and ITEMOBSCURE to assess relationships between each challenge item and the target variables (PWISWB, PWILIFE, TIPIEXTRV, TIPICONSC). Correlations within each of the two analyses were assessed separately for the chatbot and control groups. The results are displayed in Table 3 (for full sample) and Table 4 (propensity matched sample), including confidence intervals around each value for Pearson’s r, and relative difference in corresponding values r between experimental conditions (Δr).

Table 3 Correlations and confidence intervals by condition between challenge and target measures for intention-to-treat sample (n = 300)
Table 4 Correlations and confidence intervals by condition between challenge and target measures for propensity score matched subsample (n = 178)

Full sample results

Following Cohen (1988), absolute value differences in Pearson’s r between the two groups indicated small but meaningful effects (Δr > |.1|) in data from chatbot users were observed for the full sample between ITEMOBSCURE and TIPIEXTRV, as well as between ITEMOBSCURE and TIPICONSC. In both instances, the larger value for Pearson’s r was observed in data for the chatbot condition.

All challenge items were significantly related to every target variable within data from both chatbot and control conditions. Positive values for Δr were observed across all inter-correlation comparisons except for the relationship between ITEMOBSCURE and PWILIFE, revealing a reasonably consistent trend for stronger underlying correlations for the chatbot group. However, none of the observed differences reached significance using comparisons based on Fisher’s z-transformation of correlations.

Propensity matched sample results

When comparing those who used the chatbot feature against a propensity matched control group, three results showed small but meaningful differences in magnitude of correlations across the groups (Δr > |.1|): correlations between (1) ITEMAMBIG and TIPIEXTRV, (2) ITEMOBSCURE and TIPIEXTRV, and (3) ITEMOBSCURE and TIPICONSC. Across all correlation pairs, magnitude of correlation was greater for the chatbot group, though only the difference for ITEMOBSCURE and TIPIEXTRV was significant.

Discussion

Online questionnaires are indispensable tools for contemporary social and psychology research (Gosling & Mason, 2015), but their ubiquity belies fundamental methodological caveats. Questionnaire items that are ambiguous, obscure, or otherwise confusing, are thought to impose increased cognitive demands on survey respondents (Lenzner et al., 2010, 2011), predisposing them to engage in compensatory behaviors theorized to alleviate the heightened burden (Krosnick et al., 1996). Such behaviors can compromise the accuracy of responses captured in the survey data, undermining subsequent statistical analyses and the research findings they inform (Van Vaerenbergh & Thomas, 2013). Predicated on the established applicability of autonomous conversational agents to analogous problem domains (Radziwill & Benton, 2017; Zumstein & Hundertmark, 2017), this pilot study thus set out to explore the potential utility of user support chatbots for bolstering the integrity of Internet-based survey research. The present findings broadly support the chatbot help feature, showing that it was used, perceived as useful, and enhanced data quality.

Key findings

The majority of participants with access to assistance did indeed use the chatbot, doing so primarily for help with the confusing challenge items (Lenzner et al., 2010). Feedback from chatbot users was largely positive, with participants reporting that the chatbot made survey completion easier. This might be explained by the chatbot’s overt role in resolving item confusion, thereby pre-empting the perceived response burden associated with satisficing (Barge & Gehlbach, 2012). Usability was rated positively, and most chatbot users endorsed the prospect of wider availability of similar assistance in online surveys, mirroring prior findings of chatbot acceptance (Clément & Guitton, 2015).

Of the two challenge items that were included to evoke confusion-induced suboptimal responding (Lenzner et al., 2010), ITEMOBSCURE generated the majority of requests for chatbot assistance. This disparity in chatbot utilization revealed an interesting relationship between chatbot usage and data quality; data quality was improved (relative to controls) where participants actually used the chatbot to resolve item confusion, but the greatest gains in correlation magnitude occurred for those relationships that were found to be smaller in both groups (i.e., those involving the personality target measures rather than the well-being target measures). Thus, the benefits of chatbot functionality may be most pronounced where the population correlation values are small, and subtle effects due to item confusion may make the difference between a significant and non-significant finding in one’s study.

It is also noteworthy that two small yet-meaningful improvements in correlation magnitude for the chatbot group were found for the ITEMOBSCURE item compared to one for the ITEMAMBIG item. In light of the rarity of chatbot queries targeting ITEMAMBIG, this finding was unsurprising. One possible reason for the unanticipated ineffectiveness of this challenge item for eliciting requests for assistance might be the generally healthy sample recruited into this study (scoring near or above Australian well-being average; Cummins et al., 2003). It is conceivable that individuals who are broadly satisfied with each of the life domains that are conflated by ITEMAMBIG might perceive no conflict when rating their satisfaction across those domains collectively.

Finally, an age effect was found for use of the chatbot feature among those given access to it. Whereas 79% of participants in the chatbot condition who used the feature were in the 18–25 and 26–35 age groups, only 55% of chatbot abstainers were within these age groups, suggesting that older participants were less likely to use the chatbot feature. We propose several plausible explanations for these age-related effects. First, age may act as a proxy for acquired knowledge. Younger participants may be less likely to have been exposed to the phrase joie de vivre (the challenge item most often prompting chatbot use in the current study), and hence need for the chatbot function may skew towards younger participants. Second, there may be age-related differences in help seeking approach. There is some evidence to suggest that older individuals may have less interest in using technology (Ellis & Allaire, 1999). Further, older individuals may prefer to interact with a human to find out information (Nadarzynski et al., 2019; Van der Groot & Pilgrim, 2020).

Study limitations

In designing this study, the authors chose to formulate the challenge items (for impacting data quality) on item characteristics most often linked to confusion and response errors (obscure or ambiguous wording; Lenzner et al., 2010, 2011; Lietz, 2010), but there are other ways to manipulate item difficulty that might lead to different usage patterns (response scales; DeCastellarnau, 2018). The use of only two challenge items in an otherwise brief and uncomplicated questionnaire is not reflective of long, onerous surveys. This likely led to an underestimation of the protective influence of chatbot use on data quality in the present findings, particularly in light of the lack of item confusion linked to ITEMAMBIG.

A second limitation is the choice of items for manipulating participant confusion. In testing the utility of a chatbot feature, the present study design required a trade-off between experimental control (as optimized in the present study) and ecological validity. Our design was chosen under the assumptions that: (i) uncommon words would elicit confusion, and (ii) this confusion would be a prompt for chatbot utilization. The chatbot use logs suggest we were successful in generating this confusion for the joie de vivre item. In contrast, in many existing scales the participant confusion is likely to be harder to predict since scale developers will often generate items that seek to minimize jargon, vague, or uncommon terminology. Thus, there would be considerable uncertainty in how many participants were needed to evaluate chatbot utility for existing scales, where the items are understood by the majority of participants, rendering this approach less practical for research. Even so, we recognize that level of confusion may determine whether an individual seeks help, either from a human or chatbot. It is presently unclear what level of confusion is needed to elicit help seeking behavior; further research is warranted to address this.

A third limitation of this study was the high level of educational attainment in the sample (over half had graduate-level qualifications). This likely counteracted the impact of challenge items, enhancing response accuracy for the control group, and limiting the scope for detecting group differences in data quality. For example, participants might have resolved the seeming ambiguity of ITEMAMBIG by drawing inferences about the item’s likely objective based on its similarity to the preceding well-being questions (Tourangeau et al., 2000). However, the consistently weaker relationships among challenge and target measures in control group data—in terms of absolute strength, including where no significant group differences were observed—suggests that a sample with more representative educational attainment might widen the potential difference between unassisted and supported response accuracy, improving the ability to detect an effect for chatbot use.

Finally, as a pilot study, it was unclear how many individuals would engage with the chatbot if assigned to this treatment group. The high number of participants assigned to the chatbot condition who did not utilize this feature led to a reduction in power that may contribute to several null findings. Present findings may thus help future studies to more accurately calculate required sample size. This non-compliance threatens validity by potentially disrupting balancing due to randomization, and it is also unclear what reasons account for failure to use the chatbot (e.g., simply failing to see the feature or refusal to use it). It is assuring to see that omitted participants in the chatbot group did not differ from those were retained in terms of demographics and psychological constructs, with the exception of age. Lack of group difference for conscientious scores, in particular, suggests against potential biases in response such that more conscientious participants may be assumed to comply more fully with instructions of the study. Even so, more expansive testing of individual difference factors in future studies may enhance understanding of who is more likely to avail themselves of chatbot support for their survey completion.

Future directions

Both the pattern of observed findings and the above-mentioned study limitations provide directions for further research. First, as a pilot study, tests of the chatbot’s functionality were constrained to two types of confusing items. Evaluation of the provision of chatbot functionality across a wider range of manipulated survey design features (e.g., DeCastellarnau, 2018) in future studies could help to elucidate the contexts in which such functionality is likely to be adopted and have greatest impact on data quality. A key challenge in such a study is to be able to anticipate the types of queries that participants are likely to provide the chatbot with.

Second, the present study evaluated demographic and personality factors that may be associated with acceptance and utilization of the chatbot feature. Age-related effects in chatbot usage were evident, but require further examination. Factors such as level of motivation for completing the survey, preference for anonymous online communication versus human-to-human interactions, and comfort level with technology may serve as moderators of chatbot utilization and benefit. Level of access to and frequency of use of the Internet may also influence uptake of the chatbot. For instance, individuals who regularly use the Internet may have previously encountered chatbots, and may thus have prior experience and formed expectations that guide their interactions with chatbots in future.

Relatedly, an individual’s reading level, cognitive ability, and mental state may also determine the value they may derive from a chatbot. While chatbot functionality could plausibly help respondents to provide more accurate data in clinical contexts (e.g., to inform diagnosis and treatment allocation; Vaidyam et al., 2019), the text-based elements of a chatbot assume a level of reading ability (and motivation) that may preclude some potential end-users. Co-design principles may be useful for determining the appropriate amount of text-based content and the word comprehension levels that would make the product accessible to a broader audience.

Conclusions

In summary, findings showed that chatbot assistance, when utilized, may make modest contributions to enhancing data quality. Participants were inclined to seek help from a chatbot if given the option, generally found the feature to be beneficial, and broadly endorsed its wider adoption in online surveys. However, parameter constraints due to the exploratory nature of this pilot study are believed to have led to an under-estimated effect for chatbot use. Several noted limitations warrant further study, including expanding the types and volume of challenge items used, sampling in a more diverse group (especially with respect to educational attainment), and need for more targeted focus on reasons for non-compliance among participants who fail to utilize the chatbot feature. These lines of inquiry would enhance understanding of the potential utility of chatbots for survey completion and data quality.

Open Practices Statements

None of the data or materials for the experiments reported here is available, and none of the experiments was preregistered.