skip to main content
10.1145/3613905.3650732acmconferencesArticle/Chapter ViewFull TextPublication PageschiConference Proceedingsconference-collections
Work in Progress
Free Access

Human-AI Collaboration in Thematic Analysis using ChatGPT: A User Study and Design Recommendations

Authors Info & Claims
Published:11 May 2024Publication History

Abstract

Generative artificial intelligence (GenAI) offers promising potential for advancing human-AI collaboration in qualitative research. However, existing works focused on conventional machine-learning and pattern-based AI systems, and little is known about how researchers interact with GenAI in qualitative research. This work delves into researchers’ perceptions of their collaboration with GenAI, specifically ChatGPT. Through a user study involving ten qualitative researchers, we found ChatGPT to be a valuable collaborator for thematic analysis, enhancing coding efficiency, aiding initial data exploration, offering granular quantitative insights, and assisting comprehension for non-native speakers and non-experts. Yet, concerns about its trustworthiness and accuracy, reliability and consistency, limited contextual understanding, and broader acceptance within the research community persist. We contribute five actionable design recommendations to foster effective human-AI collaboration. These include incorporating transparent explanatory mechanisms, enhancing interface and integration capabilities, prioritising contextual understanding and customisation, embedding human-AI feedback loops and iterative functionality, and strengthening trust through validation mechanisms.

Skip 1INTRODUCTION Section

1 INTRODUCTION

The increasing integration of Human-AI collaboration in qualitative research within the Human-Computer Interaction (HCI) community reflects a shift towards a more symbiotic relationship between humans and AI [2, 20, 24, 39]. This collaboration, particularly in thematic analysis, leverages AI to augment research efficiency and analytical depth [26]. However, traditional AI systems, although effective in certain aspects, struggled with scalability and diversity in linguistic features [9, 18, 31]. Recent advancements in generative AI, such as ChatGPT and other large language models, have shown potential in processing and analyzing unstructured text with minimal human input [17, 27, 36, 38, 40]. These models, with their extensive pre-training, can identify complex data patterns, offering a new dimension to qualitative analysis. However, the integration of such AI into the iterative, human-centric thematic analysis process and the optimization of this collaboration remain underexplored [29, 32].

The contribution of our research includes five actionable design recommendations synthesised from the benefits and challenges that qualitative researchers experienced when collaborating with ChatGPT for thematic analysis. These recommendations can offer guidance to HCI scholars in designing and developing applications that optimise the process of human-AI collaboration in thematic analysis and fortify user trust and confidence in the technology.

Skip 2RELATED WORK Section

2 RELATED WORK

Thematic analysis is a qualitative research method that involves uncovering and analyzing themes or patterns within qualitative data, such as interview transcripts and open-ended responses [8]. As conceptualized by Braun and Clarke [9], depending on the epistemological assumptions, this multifaceted method can be undertaken deductively, where researchers analyze and interpret data through the lens of existing research and theory, or inductively, where the analysis is grounded in the data. Regardless of the approach, it is an iterative method, where researchers need to go over the data multiple times to gain an in-depth understanding of the data, articulate, refine, and confirm the themes or patterns that they have identified [7]. Consequently, thematic analysis is a time-demanding and skill-intensive task that may prove challenging to scale with a large corpus of qualitative data [11, 28].

Research on human-AI collaboration in thematic analysis has produced several prototype systems to support automated or semi-automated coding, including text classification, topic modelling, and annotations of qualitative data [18, 19, 26, 31]. These systems have leveraged a mix of techniques, ranging from machine-learning techniques for text classification [6, 19, 31, 37] to interactive topic modelling algorithms [3, 5, 21, 25, 26] and adaptive pattern-based rules [13, 18, 23, 28, 31]. Systems based on machine-learning techniques like logistic regression [31] and support vector machine models [37] have demonstrated potential in automating large-scale text data annotation. Topic modelling algorithms, especially variations of Latent Dirichlet Allocation, have shown performance comparable to open coding in identifying underlying topics in qualitative data [3, 21, 26]. These techniques, however, raise concerns about transparency in their decision-making, which may challenge the primary objective of thematic analysis: gaining an in-depth understanding of qualitative data [11, 28]. In contrast, adaptive pattern-based systems such as Cody [31] and PaTAT [18] adopted user-interpretable patterns to enhance user understanding and trust. Yet, their reliance on pattern-based rules might constrain their analysis of data with varied linguistic features [18]. These systems may also be unsustainable considering the nuances in the linguistic and semantic features of different languages [15].

GenAI has witnessed significant progress in recent years, providing a foundation for developing solutions capable of generating human-like contents [32]. The core idea behind GenAI is to train models on vast amounts of data, enabling them to produce original content that closely mirrors human language patterns [29, 36]. Among the notable advancements in this field are large language models like ChatGPT, which have been trained on diverse textual data sources, allowing them to comprehend and generate a wide range of language structures [32, 38]. Beyond text generation, these models are also capable of understanding context, offering explanations, and assisting in tasks such as thematic analysis [1, 40]. The vast knowledge encoded in these models, coupled with their ability to generalize from training data, positions them as potential collaborative partners in qualitative research [16]. Their ability to capture nuanced language structures and provide contextually relevant content has the potential to support qualitative researchers in the labour-intensive process of thematic analysis [40]. However, while GenAI exhibits promise, it also raises questions about its suitability and reliability in qualitative research contexts [4, 30]. For instance, the reliability of the decisions made by these models and their trustworthiness remain topics of discussion [36], especially when used in thematic analysis, where understanding and transparency are paramount [9, 18]. Thus, the interplay between GenAI technologies and qualitative researchers requires further investigation to uncover collaborative potential and areas of concern following a human-centred AI approach [34, 35].

The current user study aims to address the aforementioned gaps by delving into qualitative researchers’ perceptions of their collaboration with GenAI, specifically ChatGPT, and highlighting actionable design recommendations to foster effective human-AI collaboration with this novel technology.

Skip 3METHODS Section

3 METHODS

We conducted a user study to investigate the experience of qualitative researchers when collaborating with ChatGPT to conduct thematic analysis, with a specific focus on uncovering the opportunities and challenges that they perceived during this human-AI collaboration process. We recruited participants following a criterion-based sampling via a faculty and a personal mailing list. Participants were doctoral students or full-time academics with at least one year of experience in qualitative research and had conducted thematic analysis, either inductively or deductively, themselves. Three full-time academics (P1–P3) and seven doctoral students (P4–P10) with different years of experience (M = 4.0, SD = 3.3; four females) in thematic analysis participated in this study. Participants reported having some experience with using ChatGPT for research purposes (M = 3.7, SD = 1.2), as indicated by their responses on a five-point Likert scale (where 1 signifies strongly disagree and 5 signifies strongly agree). Ethics approval was obtained from Monash University (Project ID: 38196).

The study design followed a contextual inquiry approach, emphasizing user-centeredness to gain insights into participants’ genuine experiences in their usual environments [22]. This method is effective in HCI research for understanding specific task-related user needs [31]. Each session, lasting about an hour, consisted of two parts. Thematic Analysis Exercise (20 mins): Participants first engaged in inductive thematic analysis on a short transcript of 15 utterances, following Braun and Clarke’s guidelines for an open and organic coding process without a pre-set framework [9]. This task aimed to familiarize participants with the transcript content for evaluating ChatGPT’s output in the subsequent task. Collaboration with ChatGPT (40 mins): Participants were introduced to ChatGPT (5 mins) and then worked on an extended transcript (30 utterances total) using ChatGPT (GPT-4 Default mode). During this task, participants performed thematic analysis in collaboration with ChatGPT, employing the think-aloud method for in-situ evaluation of outputs (15 mins). A semi-structured interview (20 mins) followed, assessing their experience with ChatGPT in terms of accuracy, trustworthiness, and helpfulness for thematic analysis. Additionally, participants provided feedback on ChatGPT’s interface and functionality for thematic analysis needs. Sessions were remote via Zoom, capturing audio, screen visuals, and ChatGPT chat history.

Using Otter AI (otter.ai), we transcribed each session’s audio recording. The primary author conducted a reflexive thematic analysis [9] of these transcripts in conjunction with each participant’s ChatGPT chat history. This initial analysis was augmented by discussions with a co-author, facilitating the iterative refinement of emergent themes related to participants’ perceived opportunities and challenges when collaborating with ChatGPT to conduct thematic analysis. We commenced our analysis on an individual participant basis, extracting preliminary codes. Subsequently, we aggregated these preliminary codes into overarching themes. The goal of this analysis was to comprehend the nature of participants’ interactions with ChatGPT and inform design recommendations for future HCI studies.

Skip 4FINDINGS AND DISCUSSION Section

4 FINDINGS AND DISCUSSION

4.1 Human-AI Collaboration Opportunities

4.1.1 Efficiency in Processing and Analysis.

The efficiency of ChatGPT in processing large datasets and distilling themes was a salient observation. Such proficiency offers a promising avenue for the swift analysis of unstructured qualitative data. P1 articulated: "ChatGPT provides an efficient method to identify key themes from qualitative data." This sentiment was echoed by P7, who observed: "For this task, I think, yeah, it definitely can save my time. Yeah, to identify emerging themes." P10 went a step further, emphasizing the technology’s superiority in certain aspects of analysis: "I think it’s good at doing this kind of summarization task. Even better than human." The advantages of ChatGPT become even more pronounced when handling voluminous datasets. P8 remarked: "Especially if I have a lot of data and I really don’t have time commitment or to analyze it by myself." Beyond its efficiency, ChatGPT’s capabilities in transforming unstructured datasets into structured formats were also recognized. P2 commented: "I was impressed at its ability to, at least for some of the data, format it in a way that I asked it to right with, like the table and stuff."

4.1.2 Facilitation of Initial Exploration.

Participants recognized the instrumental value of ChatGPT during the preliminary stages of both inductive and deductive analytical processes. They posited that ChatGPT could offer a foundational coding framework that would be particularly beneficial during the initial stages of code generation, exploration, and ideation. P2, for instance, highlighted the tool’s capacity for summation and ideation by stating: "I have identified that ChatGPT is very useful to summarize general topics, so I will use it for inspiration, and to identify codes in the process." In the context of the inductive coding approach, P8 underscored ChatGPT’s potential for exploration: "So, for instance, we were talking about the inductive way of coding, I will ask ChatGPT, can you give me the list of the most prominent themes which are presented in the piece of text." Extending on this, P8 also suggested that the themes generated by ChatGPT could set the stage for developing an initial coding framework for deductive analysis: "When we’re doing deductive analysis, it [ChatGPT] will give the list of themes, and if we ask ChatGPT, okay, can you try to find instances [utterance] of those themes. It [ChatGPT] also will work fine." This potential for initiating the coding process in deductive analysis was also evidenced by P9, who highlighted the potential for coding interview transcripts: "I will have multiple interview transcripts, and I will code maybe one or two transcripts, and then I will probably pass it to ChatGPT, and maybe it will learn the patterns and help me to code the transcripts."

4.1.3 Quantitative Insights and Detailed Metrics.

Beyond assisting with the initial exploration stages, ChatGPT’s ability to offer quantitative insights and detailed metrics like coverage is appreciated. For example, P1 highlighted its strength in providing a detailed breakdown of theme coverage through prompting: "I think that’s really useful getting all these metrics of coverage." Such metrics can help researchers understand the extent to which a particular theme or topic is represented within the dataset. P2 echoed this sentiment, emphasizing ChatGPT’s capacity to shed light on "the frequency or popularity of certain themes or topics." This capability can be instrumental in discerning recurrent themes, thereby paving the way for deeper qualitative investigations, such as identifying sub-themes. Meanwhile, P7 also saw the potential of ChatGPT in analyzing coded data: "But after coding, we still need to do some analysis, right? We are not only going to do some frequency tables. If it can help me with analyzing coded data to identify some patterns or some preliminary analysis, it will be good."

4.1.4 Support Language Comprehension.

Participants also appreciated ChatGPT’s assistance in language comprehension, especially for non-native English speakers. For example, P4 reflected that: "Not only about time-wise, but also sometimes because, for example, for me, English is not my first language. Sometimes the interviewees may say something too casual [or] too verbal that I couldn’t understand, [but] I think ChatGPT could." P10 pointed out ChatGPT’s value in understanding language usage in complex or unfamiliar domains: "I think a human has some limitation of knowledge. For example, maybe I don’t understand when talking about politics, but ChatGPT can understand it better."

4.1.5 Discussion.

The findings on the efficiency and efficacy of ChatGPT in supporting the initial exploration of qualitative data resonated with prior works [18, 19, 31, 40], illustrating the collaborative synergy between researchers and AI, where they both contribute to the process, instead of completely outsourcing the task to AI. The appreciation for quantitative insights and detailed metrics further strengthened such collaborations where researchers can leverage these additional insights offered by AI to conduct further in-depth analysis. This finding also reinforced the need for human-centred AI where the goal of well-designed AI systems should be supporting humans instead of replacing them [34, 35]. Apart from these findings, we also identified a unique benefit of GenAI, that is, supporting the language comprehension of non-native speakers and non-experts. Such a novel finding is unseen in prior works on machine-learning and pattern-based AI systems [3, 18, 23, 31]. This finding further highlights the augmentation of human capabilities that human-AI collaboration enables, especially in skill-intensive tasks like thematic analysis, where experience and expertise are essential for ensuring quality results [7, 9].

4.2 Human-AI Collaboration Challenges

4.2.1 Trustworthiness and Accuracy Concerns.

Participants consistently expressed concerns about the trustworthiness and accuracy of ChatGPT’s outputs. They were sceptical of the results produced and often felt the need to manually verify them. P2 articulated this scepticism: "If I notice something is not very well performed, I will stop trusting the technology, and I will maybe have to go on my own and check manually." This sentiment was resonated by P4, who expressed a balanced stance: "I have mutual experience, a neutral perception on the extent I trust ChatGPT." Moreover, P6 underscored the indispensable role of human oversight: "So it needs to be checked by a human before it can be used for any research." Interestingly, while participants generally acknowledged ChatGPT’s proficiency in certain tasks like summarization, doubts emerged when it came to numerical data. As P5 observed: "It has a good side on the summarizing and like integrating codes like that. I trust that part. But about the coverage and like any number, basically the numbers, I do not trust it." A point of contention arose around the potential for ChatGPT to generate hallucinated results. P8 cautioned: "First of all is that, for instance, it might be inaccurate in the sense that the codes might be made up." Yet, P10 offered a more nuanced view, suggesting that while ChatGPT might occasionally deviate from accuracy in some contexts, thematic analysis might not be one of them: "When it hallucinates, it’s usually under the context when I ask for some specific literature, but in this task [thematic analysis] we give it [ChatGPT] the material and let it summarise, seems like an easier task." These varied perspectives accentuated P9’s call for validation works: "So one thing, we need some literature to show the ChatGPT can actually be used for thematic analysis."

4.2.2 Reliability and Consistency Issues.

Apart from participants’ inherent trust in the technology, the challenge of obtaining consistent results from ChatGPT was another common concern. For example, P3 expressed concerns about the temporal consistency of the system: "I’m wondering how consistent ChatGPT is over time, you know, with the same stuff." This sentiment hints at potential intra-reliability issues, as P3 further explained: "It [ChatGPT] might need some refinement with itself. There are intra-reliability problems that might be happening." This concern was echoed by P6, who observed variances in the system’s outputs: "I know sometimes the answers can be quite random and inconsistent, no matter how you prompt the request." Intriguingly, P6 also noted potential differences in results even when using identical prompts: "Yeah, sometimes with the same prompt. If you use it for the first time, you get great results. But when you use it for even the same transcript, it can produce different results." Highlighting the broader implications of this inconsistency, P8 underscored the user’s inclination towards more deterministic systems: "You’re more inclined to serve the trust to the software, which is more deterministic."

4.2.3 Limited Data Capacity and Contextual Understanding.

Several participants pointed out ChatGPT’s limitations in terms of data processing capacity and its understanding of context. P3 mentioned: "It [ChatGPT] has some limitations to the amount of data you can give it." P9 resonated with this concern: "From my previous experience with ChatGPT, it has a problem with the length of the token [input text]." This capacity issue may limit the length of the transcript that ChatGPT can analyze, as P9 elucidated: "The transcripts for thematic analysis [are] usually very long. And in that case, we can only input one session to make it [ChatGPT] learn." This segmented input could compromise the quality of outcomes, with P9 noted: "If it [transcripts] cannot be inputted as one message, from my previous experience, the learning efficiency [of ChatGPT] will be harmed." Another concern is ChatGPT’s ability to infer deeper contexts. P4’s critique explained this: "Sometimes ChatGPT only provides information based on the text, and sometimes the text may not be able to 100% reflect on the key things that the [participant] wants to express." Drawing from personal experiences in domain-specific research, P4 underscored the necessity of a contextualized understanding, especially in thematic analysis. For example, P4 explained the following coming from the context of law education: "I found that if you had experience learning or teaching experience in law, you would have a better understanding about what the [study participants] want to express." Meanwhile, P7 highlighted the lack of academic focus of ChatGPT: "I think it only can identify some emerging themes based on the transcripts, but it’s not really academic oriented." The imperative for deep comprehension of existing academic literature was also highlighted by P7: "I think it [ChatGPT] needs to understand the literature. So I probably need to upload my literature review or at least what articles that I used to create the coding scheme?" Without such an enriched contextual foundation, there could be a misalignment between humans and AI. As P10 summarized, "I feel like it’s about the weight of importance that ChatGPT has maybe misaligned with researchers."

4.2.4 Interface and Integration Challenges.

Participants also discussed the difficulties and inefficiencies in interacting with the current interface of ChatGPT for thematic analysis. P3 highlighted the suboptimal experience: "And obviously, it’s not ideal to be interacting with it in the ChatGPT window." Expanding on this, P3 underscored the system’s inefficiencies when adopting an iterative methodology: "I will probably do [the analysis using ChatGPT] like 2 or 3 more times, but at least in the way that the interface is set up, it’s not conducive to me doing it quickly and easily." On the same note, P9 emphasized the need for better integration and formatting tools: "Maybe we can also format the input as a table; maybe we can also import the transcription table directly as an input to the AI."

4.2.5 Acceptance within the Research Community.

Participants expressed concerns about the adoption and disclosure of human-AI collaboration in qualitative research. A predominant challenge arises from potential scrutiny by reviewers regarding the tool’s reliability and the validation process. This is coupled with the broader acceptance of such technologies within the scientific community. P6 highlighted the dilemma: "We use ChatGPT. We want to disclose it. However, we can imagine there could be challenges from reviewers about reliability and accuracy. And how do you validate the process?" This participant further commented on the societal perceptions around the use of GenAI technologies like ChatGPT, especially in academic contexts. As P6 explained: "I think when I publish something, I’ll tell everyone. Okay. I used ChatGPT to polish my articles. That just sounds sort of weird." P1 echoed the sentiment of uncertainty but also expressed optimism about the future adoption of these technologies in research: "But if it is about the use of AI in research, I think that’s a big question, and we don’t know what needs to be done at this point." P1 further articulated hope for a future where the use of GenAI is normalized: "I guess it will be more accepted as well, and it will be just another method being used like any other methods. So it’s not about disclosing if you’re using AI or not, but just simply describing the method and the tool that was used." However, not all views were optimistic. P9 highlighted potential biases within the reviewer community: "I know some reviewers actually, very hate the AI in doing stuff."

4.2.6 Discussion.

The concerns regarding the trustworthiness, accuracy, reliability, and consistent of AI-generated resonated with previous findings [18, 21, 26, 31], indicating the need to include trust-assuring mechanisms when designing human-AI collaboration. Whereas, the limited data capacity and contextual understanding seem to be a novel challenge for GenAI. While data capacity issues might be resolvable through model advancements and algorithm innovations [12], contextual understanding can only be achieved through effective human-AI collaboration, where humans actively provide and refine the contextual knowledge of AI during an iterative circle [7]. Such a synergy would require designing an interface that prioritises the continued and mutual communication between humans and AI [33, 35]. Lastly, an ethical dilemma exists between adopting a human-AI collaborative approach and adequately reporting such usages in academic publications. While this issue remains an ongoing debate, there is an increasing consensus on the need for transparency, disclosure, and clear guidelines to ensure that AI contributions are acknowledged without undermining the authenticity and integrity of the research [14, 36]. Apart from guidelines, in the context of thematic analysis, empirical evidence is urgently needed to evaluate the validity of the AI-generated results.

4.3 Design Recommendations

Building on our findings, we have synthesised five actionable design recommendations to support future HCI studies in designing and developing AI solutions that streamline the process of human-AI collaboration in thematic analysis while also ensuring user trust and confidence in the technology.

4.3.1 Incorporate Transparent Explanatory Mechanisms.

To increase user confidence and understanding of the technology, it is crucial to design human-AI collaboration solutions with transparent explanatory mechanisms. The AI system should offer detailed rationales behind the generation of specific themes. For instance, by providing relevant data metrics (P2: "it would be good to see the actual coverage and see it everywhere") or highlighting illustrative examples for each theme (P5: "having quotes... yeah, like for each line, this is the code. And this is the quote"), users can discern the AI’s decision-making process. Incorporating features that allow the AI to show its "thought process" or its basis for conclusions can demystify its operations, ensure that generated themes align with data nuances, and potentially improve its sense of trust.

4.3.2 Enhance Interface and Integration Capabilities.

A human-centred interface redesign is critical. The user interface should be intuitive, allowing for easy navigation and operations such as data uploads and theme suggestions (P3: "Imagine a step like, I upload my data, and then I click a button and to say like, what do you think the themes are in this?"). Integration features, like spreadsheet compatibility (P1: "Maybe a way to submit this spreadsheet... and then the output to be the same spreadsheet"), can streamline the thematic analysis process, making it more efficient for users. Moreover, considering compatibility with other research tools and platforms can elevate the user experience (P6: "I wish it could be an all-in-one system that, for example, after this interview... can automatically summarize what we said on Zoom."), ensuring that human-AI collaboration seamlessly fits into the researcher’s workflow.

4.3.3 Prioritize Contextual Understanding and Customization.

The design of human-AI collaboration solutions should prioritize a deep understanding of the data’s unique context. Features that allow users to provide contextual cues or specific research backgrounds can enhance the AI’s sensitivity to nuances (P4: "I prefer to have some kind of tools that I can provide more contextual information to when generating the automatic analysis. Now it’s just a transcript."). Furthermore, customization options, where users can adjust parameters or guide the AI’s theme generation based on their specific needs (P5: "I have the next step, which is to introduce some of my understanding or my knowledge."), can ensure that the generated themes are rooted in the data’s specific context.

4.3.4 Embed Feedback Loops and Iterative Functionality.

A dynamic and collaborative relationship between the user and AI can be fostered by embedding feedback loops within the system. By allowing users to provide feedback on generated themes, and then having AI adjust its outputs accordingly in real-time, the tool can evolve and refine its understanding (P2: "If I have identified something which is not correct from the tool. If I have some chance to provide additional feedback to the tool to recalculate and to improve."). Such iterative functionality underscores the idea that thematic analysis is a continuous process (P3: "Obviously, this could be an iterative process."), which can benefit immensely from ongoing human-AI collaboration.

4.3.5 Strengthen Trust through Validation Mechanisms.

Building user trust is foundational for effective human-AI collaboration. Designing AI systems with robust validation mechanisms can address concerns regarding their reliability and accuracy. While, the aforementioned transparent explanatory mechanisms, such as coverage metrics, can be an enabler of such validations, more rigorous and empirically-orientated mechanisms are required. For example, features that allow side-by-side comparisons of AI-generated themes with human analysis or that offer manual coding checks can enhance user confidence (P10: "I need to compare the results generated by ChatGPT and human."). Such mechanisms could also contribute empirical evidence that is required to convey the validity and reliability of this human-AI collaborative approach to the academic community and external reviewers.

Skip 5LIMITATIONS AND FUTURE WORKS Section

5 LIMITATIONS AND FUTURE WORKS

This study has several limitations. First, ChatGPT’s interface is not optimized for thematic analysis, yet interactions with this GenAI provide insights for future interface design. Second, the interview transcript used has 30 utterances (1,685 tokens) to avoid context length issues (e.g., GPT-4’s 8192-token limit), potentially not reflecting challenges with larger datasets. Only two participants (P3, P9) with ChatGPT experience highlighted this issue. Third, the lack of research questions may have limited insights into deductive thematic analysis challenges. Lastly, we identified opportunities and challenges but did not explore the specific processes researchers used during thematic analysis. Future work should design a system informed by this study’s findings to support thematic analysis and human-AI collaboration. Evaluating such a system against Cody [31] and PaTAT [18] could reveal GenAI’s advantages over prior systems. System architectures addressing context length limitations, like assessing chunking strategies’ efficacy [10], also require more investigation.

Skip 6CONCLUSION Section

6 CONCLUSION

This study investigated the collaborative experience of qualitative researchers using the leading GenAI application, ChatGPT, for thematic analysis. Our findings indicate that researchers value ChatGPT’s potential to enhance efficiency and deepen qualitative data comprehension. However, concerns exist regarding trust assurance mechanisms and the practicality of the current interface for interactive analysis. Drawing from these insights, we synthesized five design recommendations that aim to inform the development of future GenAI systems that not only facilitate seamless human-AI collaboration in qualitative research but also enhance user trust and confidence in the technology.

Skip ACKNOWLEDGMENTS Section

ACKNOWLEDGMENTS

Research of Roberto Martinez Maldonado, Lixiang Yan and Dragan Gasevic is partly funded by Digital Health Cooperative Research Centre. Research of Roberto Martinez Maldonado, Zachari Swiecki, Lixiang Yan, Vanessa Echeverria, Linxuan Zhao and Dragan Gasevic is partly funded by Australian Research Council (DP210100060 and DP240100069). Research of Zachari Swiecki, Dragan Gasevic, Gloria Fernandez Nieto, and Lixiang Yan is partly funded by Defense Advanced Research Project Agency (DARPA) under agreement number HR0011-22-2-0047. The U.S. Government is authorised to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of DARPA or the U.S. Government.

Skip Supplemental Material Section

Supplemental Material

3613905.3650732-talk-video.mp4

Talk Video

mp4

10.8 MB

References

  1. Ishari Amarasinghe, Francielle Marques, Ariel Ortiz-Beltrán, and Davinia Hernández-Leo. 2023. Generative Pre-trained Transformers for Coding Text Data? An Analysis with Classroom Orchestration Data. In European Conference on Technology Enhanced Learning. Springer, 32–43. https://doi.org/10.1007/978-3-031-42682-7_3Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Saleema Amershi, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul N. Bennett, Kori Inkpen, Jaime Teevan, Ruth Kikin-Gil, and Eric Horvitz. 2019. Guidelines for Human-AI Interaction. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI ’19). Association for Computing Machinery, New York, NY, USA, 1–13. https://doi.org/10.1145/3290605.3300233Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Aneesha Bakharia, Peter Bruza, Jim Watters, Bhuva Narayan, and Laurianne Sitbon. 2016. Interactive Topic Modeling for Aiding Qualitative Content Analysis. In Proceedings of the 2016 ACM on Conference on Human Information Interaction and Retrieval (Carrboro, North Carolina, USA) (CHIIR ’16). Association for Computing Machinery, New York, NY, USA, 213–222. https://doi.org/10.1145/2854946.2854960Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, 2023. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023 (2023). https://doi.org/10.48550/arXiv.2302.04023Google ScholarGoogle ScholarCross RefCross Ref
  5. Eric P. S. Baumer, David Mimno, Shion Guha, Emily Quan, and Geri K. Gay. 2017. Comparing Grounded Theory and Topic Modeling: Extreme Divergence or Unlikely Convergence?J. Assoc. Inf. Sci. Technol. 68, 6 (jun 2017), 1397–1410. https://doi.org/10.1002/asi.23786Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Alan Blackwell, L Church, M Jones, R Jones, Matthew Mahmoudi, M Marasoiu, S Makins, D Nauck, K Prince, A Semrov, 2018. Computer says ‘don’t know’-interacting visually with incomplete AI models. In Workshop on Designing Technologies to Support Human Problem Solving-VL/HCC. 5–14.Google ScholarGoogle Scholar
  7. Richard E Boyatzis. 1998. Transforming qualitative information: Thematic analysis and code development. sage.Google ScholarGoogle Scholar
  8. Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychology. Qualitative research in psychology 3, 2 (2006), 77–101.Google ScholarGoogle Scholar
  9. Virginia Braun and Victoria Clarke. 2021. One size fits all? What counts as quality practice in (reflexive) thematic analysis?Qualitative Research in Psychology 18, 3 (2021), 328–352. https://doi.org/10.1080/14780887.2020.1769238Google ScholarGoogle ScholarCross RefCross Ref
  10. Huan-Yuan Chen and Hong Yu. 2023. Intent-Based Web Page Summarization with Structure-Aware Chunking and Generative Language Models. In Companion Proceedings of the ACM Web Conference 2023 (Austin, TX, USA) (WWW ’23 Companion). Association for Computing Machinery, New York, NY, USA, 310–313. https://doi.org/10.1145/3543873.3587372Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Nan-Chen Chen, Margaret Drouhard, Rafal Kocielnik, Jina Suh, and Cecilia R. Aragon. 2018. Using Machine Learning to Support Qualitative Coding in Social Science: Shifting the Focus to Ambiguity. ACM Trans. Interact. Intell. Syst. 8, 2, Article 9 (jun 2018), 20 pages. https://doi.org/10.1145/3185515Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. 2023. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595 (2023). https://doi.org/10.48550/arXiv.2306.15595Google ScholarGoogle ScholarCross RefCross Ref
  13. Kevin Crowston, Xiaozhong Liu, and Eileen E Allen. 2010. Machine learning and rule-based automated coding of qualitative data. proceedings of the American Society for Information Science and Technology 47, 1 (2010), 1–2. https://doi.org/10.1002/meet.14504701328Google ScholarGoogle ScholarCross RefCross Ref
  14. Yogesh K Dwivedi, Nir Kshetri, Laurie Hughes, Emma Louise Slade, Anand Jeyaraj, Arpan Kumar Kar, Abdullah M Baabdullah, Alex Koohang, Vishnupriya Raghavan, Manju Ahuja, 2023. “So what if ChatGPT wrote it?” Multidisciplinary perspectives on opportunities, challenges and implications of generative conversational AI for research, practice and policy. International Journal of Information Management 71 (2023), 102642. https://doi.org/10.1016/j.ijinfomgt.2023.102642Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Julian Eisenschlos, Sebastian Ruder, Piotr Czapla, Marcin Kadras, Sylvain Gugger, and Jeremy Howard. 2019. MultiFiT: Efficient Multi-lingual Language Model Fine-tuning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 5702–5707. https://doi.org/10.18653/v1/D19-1572Google ScholarGoogle ScholarCross RefCross Ref
  16. Jie Gao, Kenny Tsu Wei Choo, Junming Cao, Roy Ka-Wei Lee, and Simon Perrault. 2023. CoAIcoder: Examining the Effectiveness of AI-Assisted Human-to-Human Collaboration in Qualitative Analysis. ACM Trans. Comput.-Hum. Interact. (aug 2023). https://doi.org/10.1145/3617362 Just Accepted.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Jie Gao, Yuchen Guo, Gionnieve Lim, Tianqin Zhan, Zheng Zhang, Toby Jia-Jun Li, and Simon Tangi Perrault. 2023. CollabCoder: A GPT-Powered Workflow for Collaborative Qualitative Analysis. arXiv preprint arXiv:2304.07366 (2023). https://doi.org/10.48550/arXiv.2304.07366Google ScholarGoogle ScholarCross RefCross Ref
  18. Simret Araya Gebreegziabher, Zheng Zhang, Xiaohang Tang, Yihao Meng, Elena L. Glassman, and Toby Jia-Jun Li. 2023. PaTAT: Human-AI Collaborative Qualitative Coding with Explainable Interactive Rule Synthesis. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 362, 19 pages. https://doi.org/10.1145/3544548.3581352Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Mohamed Goudjil, Mouloud Koudil, Mouldi Bedda, and Noureddine Ghoggali. 2018. A novel active learning method using SVM for text classification. International Journal of Automation and Computing 15 (2018), 290–298. https://doi.org/10.1007/s11633-015-0912-zGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  20. Richard HR Harper. 2019. The Role of HCI in the Age of AI. International Journal of Human–Computer Interaction 35, 15 (2019), 1331–1344. https://doi.org/10.1080/10447318.2019.1631527Google ScholarGoogle ScholarCross RefCross Ref
  21. Matt-Heun Hong, Lauren A. Marsh, Jessica L. Feuston, Janet Ruppert, Jed R. Brubaker, and Danielle Albers Szafir. 2022. Scholastic: Graphical Human-AI Collaboration for Inductive and Interpretive Text Analysis. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology (Bend, OR, USA) (UIST ’22). Association for Computing Machinery, New York, NY, USA, Article 30, 12 pages. https://doi.org/10.1145/3526113.3545681Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Holtzblatt Karen and Jones Sandra. 2017. Contextual inquiry: A participatory technique for system design. In Participatory design. CRC Press, 177–210. https://www.taylorfrancis.com/chapters/edit/10.1201/9780203744338-9/contextual-inquiry-participatory-technique-system-design-holtzblatt-karen-jones-sandraGoogle ScholarGoogle Scholar
  23. Andreas Kaufmann, Ann Barcomb, and Dirk Riehle. 2020. Supporting Interview Analysis with Autocoding.. In HICSS. 1–10. https://hdl.handle.net/10125/63833Google ScholarGoogle Scholar
  24. Vivian Lai, Samuel Carton, Rajat Bhatnagar, Q. Vera Liao, Yunfeng Zhang, and Chenhao Tan. 2022. Human-AI Collaboration via Conditional Delegation: A Case Study of Content Moderation. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 54, 18 pages. https://doi.org/10.1145/3491102.3501999Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. William Leeson, Adam Resnick, Daniel Alexander, and John Rovers. 2019. Natural language processing (NLP) in qualitative public health research: a proof of concept study. International Journal of Qualitative Methods 18 (2019), 1609406919887021. https://doi.org/10.1177/1609406919887021Google ScholarGoogle ScholarCross RefCross Ref
  26. Robert P Lennon, Robbie Fraleigh, Lauren J Van Scoy, Aparna Keshaviah, Xindi C Hu, Bethany L Snyder, Erin L Miller, William A Calo, Aleksandra E Zgierska, and Christopher Griffin. 2021. Developing and testing an automated qualitative assistant (AQUA) to support qualitative analysis. Family Medicine and Community Health 9, Suppl 1 (2021). https://doi.org/10.1136/fmch-2021-001287Google ScholarGoogle ScholarCross RefCross Ref
  27. Yuheng Li, Lele Sha, Lixiang Yan, Jionghao Lin, Mladen Raković, Kirsten Galbraith, Kayley Lyons, Dragan Gašević, and Guanliang Chen. 2023. Can large language models write reflectively. Computers and Education: Artificial Intelligence 4 (2023), 100140. https://doi.org/10.1016/j.caeai.2023.100140Google ScholarGoogle ScholarCross RefCross Ref
  28. Megh Marathe and Kentaro Toyama. 2018. Semi-Automated Coding for Qualitative Research: A User-Centered Inquiry and Initial Prototypes. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (Montreal QC, Canada) (CHI ’18). Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/3173574.3173922Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. John V Pavlik. 2023. Collaborating with ChatGPT: Considering the implications of generative artificial intelligence for journalism and media education. Journalism & Mass Communication Educator 78, 1 (2023), 84–93. https://doi.org/10.1177/10776958221149577Google ScholarGoogle ScholarCross RefCross Ref
  30. Michael V Reiss. 2023. Testing the reliability of chatgpt for text annotation and classification: A cautionary remark. arXiv preprint arXiv:2304.11085 (2023). https://doi.org/10.48550/arXiv.2304.11085Google ScholarGoogle ScholarCross RefCross Ref
  31. Tim Rietz and Alexander Maedche. 2021. Cody: An AI-Based System to Semi-Automate Coding for Qualitative Research. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan) (CHI ’21). Association for Computing Machinery, New York, NY, USA, Article 394, 14 pages. https://doi.org/10.1145/3411764.3445591Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Malik Sallam. 2023. ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns. Healthcare 11, 6 (2023). https://doi.org/10.3390/healthcare11060887Google ScholarGoogle ScholarCross RefCross Ref
  33. Albrecht Schmidt. 2020. Interactive Human Centered Artificial Intelligence: A Definition and Research Challenges. In Proceedings of the International Conference on Advanced Visual Interfaces (Salerno, Italy) (AVI ’20). Association for Computing Machinery, New York, NY, USA, Article 3, 4 pages. https://doi.org/10.1145/3399715.3400873Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Ben Shneiderman. 2020. Bridging the Gap Between Ethics and Practice: Guidelines for Reliable, Safe, and Trustworthy Human-Centered AI Systems. ACM Trans. Interact. Intell. Syst. 10, 4, Article 26 (oct 2020), 31 pages. https://doi.org/10.1145/3419764Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Ben Shneiderman. 2020. Human-centered artificial intelligence: Three fresh ideas. AIS Transactions on Human-Computer Interaction 12, 3 (2020), 109–124. https://doi.org/10.1080/10447318.2020.1741118Google ScholarGoogle ScholarCross RefCross Ref
  36. Eva AM Van Dis, Johan Bollen, Willem Zuidema, Robert van Rooij, and Claudi L Bockting. 2023. ChatGPT: five priorities for research. Nature 614, 7947 (2023), 224–226. https://doi.org/10.1038/d41586-023-00288-7Google ScholarGoogle ScholarCross RefCross Ref
  37. Jasy Liew Suet Yan, Nancy McCracken, and Kevin Crowston. 2014. Semi-automatic content analysis of qualitative data. IConference 2014 Proceedings (2014). https://doi.org/10.9776/14399Google ScholarGoogle ScholarCross RefCross Ref
  38. Lixiang Yan, Lele Sha, Linxuan Zhao, Yuheng Li, Roberto Martinez-Maldonado, Guanliang Chen, Xinyu Li, Yueqiao Jin, and Dragan Gašević. 2024. Practical and ethical challenges of large language models in education: A systematic scoping review. British Journal of Educational Technology 55, 1 (2024), 90–112.Google ScholarGoogle ScholarCross RefCross Ref
  39. Qian Yang, Aaron Steinfeld, Carolyn Rosé, and John Zimmerman. 2020. Re-Examining Whether, Why, and How Human-AI Interaction Is Uniquely Difficult to Design. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA) (CHI ’20). Association for Computing Machinery, New York, NY, USA, 1–13. https://doi.org/10.1145/3313831.3376301Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Andres Felipe Zambrano, Xiner Liu, Amanda Barany, Ryan Shaun Baker, Juhan Kim, and Nidhi Nasiar. 2023. From nCoder to ChatGPT: From Automated Coding to Refining Human Coding. (2023). https://doi.org/10.35542/osf.io/grmzhGoogle ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Human-AI Collaboration in Thematic Analysis using ChatGPT: A User Study and Design Recommendations

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        CHI EA '24: Extended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems
        May 2024
        4761 pages
        ISBN:9798400703317
        DOI:10.1145/3613905

        Copyright © 2024 Owner/Author

        Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 11 May 2024

        Check for updates

        Qualifiers

        • Work in Progress
        • Research
        • Refereed limited

        Acceptance Rates

        Overall Acceptance Rate6,164of23,696submissions,26%
      • Article Metrics

        • Downloads (Last 12 months)176
        • Downloads (Last 6 weeks)176

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format