Sentiment overflow in the testing stack Analyzing software testing posts on Stack Overflow

Software testing is an integral part of modern software engineering practice. Past research has not only underlined its significance, but also revealed its multi-faceted nature. The practice of software testing and its adoption is influenced by many factors that go beyond tools or technology. This paper sets out to investigate the context of software testing from the practitioners’ point of view by mining and analyzing sentimental posts on the widely used question and answer website Stack Overflow. By qualitatively analyzing sentimental expressions of practitioners, which we extract from the Stack Overflow dataset using sentiment analysis tools, we discern factors that help us to better understand the lived experience of software engineers with regards to software testing. Grounded in the data that we have analyzed, we argue that sentiments like insecurity, despair and aspiration, have an impact on practitioners’ attitude towards testing. We suggest that they are connected to concrete factors like the level of complexity of projects in which software testing is practiced. Editor’s note: Open Science material was validated by the Journal of Systems and Software Open Science Board . © 2023TheAuthor


Introduction
We already know for over 40 years that software testing is one of the most pragmatic mechanisms by which we can ensure the quality of the software artefacts that we engineer [15,23,44,60]. In the light of the unquestionable growing impact that software and software supported devices are having on our daily lives, the role of software testing becomes ever more important. Just consider the year 2017, which has been earmarked "The Year That Software Bugs Ate the World" because of the astonishing software failures that cost the economy Email addresses: m.swillus@tudelft.nl (Mark Swillus), a.e.zaidman@tudelft.nl (Andy Zaidman) $1.7 trillion in 2017 alone [39]. Crucially, Ko et al. [32] report on software failures that can be directly linked to the loss of 1,500 human lives. However, to this day there is a schism between widespread recommendations for software engineering practice and our knowledge of how software testing actually happens. The urgency to solve this conflict was also signalled by others with a call to arms to better understand the testing process [13,38].
We have recently seen studies emerge that have observed how software developers test. Beller et al. [12] have investigated when and how developers write test cases in their Integrated Development Environment. They observed that around 50% of the studied projects do not employ automated testing methods at all. But they also found out that for almost all cases testing happens far less frequently than developers estimate. If testing is truly considered a last line of defense against software defects, we need to understand why developers do or do not engineer and execute test cases.
We have already seen glimpses of this in literature. Studies have shown that company culture or time pressure leads to cognitive biases during testing [4,14,52], estimations of the time it takes to write test are often inaccurate [12,30], availability of documentation shapes the development of tests [3], and that the cost/benefit of testing is often unclear [11]. Additionally, Kasurinen et al. [30], Runeson [49], and Daka and Fraser [17] highlight issues with motivating developers to test software: only half of them have positive feelings about testing, and approachability of tools is a major factor. Like Prado and Vincenzi [48] who studied the perspective of developers during the review process of unit tests to build tools that encourage testing, we follow and put the human into the center of attention. This paper sets out to investigate the circumstances that influence software engineers when engineering tests going beyond technical aspects of the discipline. Like Sharp et al. [55] we believe that in order to improve the discipline, it is essential to understand the socio-technical world in which software engineering is practiced. Software development practices which form social circumstances, like pair programming, are for example very likely to have an impact on testing. To gain a broad overview of what these circumstances are, we take negative and positive sentiments on the process of automated testing as a proxy. To gather documents which describe the experience of software developers from their point of view, we mine the most popular question and answer platform for software engineers, namely Stack Overflow [7].

RQ1
How do software engineers express sentiment about testing on Stack Overflow?
On the Q&A platform Stack Overflow, on which social interaction plays a key role, practitioners ask questions about software development which are answered by a global community of software developers [43]. Others have used the Stack Overflow dataset to investigate technical and non-technical aspects of software engineering. Lopez et al. [37] for example have analyzed security ques-tions on Stack Overflow, and provide an overview of the most discussed topics but also discuss the way in which authors discuss security questions. Our goal is to identify factors that affect practitioners and influence adoption of, or attitude towards testing. We hypothesize that an analysis of sentimental content on Stack Overflow not only reveals technical factors that can influence adoption or attitudes, but also descriptions of the social or human context of practice. To identify those socio-technical factors, we examine 200 testing related questions on Stack Overflow deeply instead of analyzing the whole dataset quantitatively. We do not only scrutinize the question asked by the practitioner, but also incorporate the answers to the question, comments and the edit history of questions into our analysis. Going beyond an analysis of questions about technical issues, we focus on the broader context that causes sentiment in practitioners. We therefore use the term post instead of question to refer to the documents we analyzed for the remainder of this paper.

RQ2
Which factors affect sentiment of software engineers towards testing practices?
From research done by other authors we know that only a small fraction of posts on Stack Overflow contains strong opinions and emotional statements as they mostly discuss how to use a piece of technology [34,53]. This motivates us to create an emotionally rich subset by filtering the dataset using a semi-automated approach that employs sentiment analysis tools.
To answer both research questions we apply strategies of Hoda's basic stage for socio-technical grounded theory (STGT) [25] with a constructivist stance as suggested by Charmaz [16]. STGT provides us with a framework to venture into a broad analysis of testing practice, seen not only as a technical phenomenon, but as a phenomenon in which social factors play an essential role. We focus our analysis on the socio-technical dimension of posts on Stack Overflow and show that such an analysis indeed reveals descriptions of social aspects. Our analysis informs about issues that contribute to problems and attitudes towards software testing. More concretely, we analyze the dataset which consists of 200 posts using initial and focused coding and techniques for systematic comparison of posts, codes and memos like diagramming and clustering. Concluding this paper with a presentation of preliminary categories and a preliminary interpretive theory, we motivate consecutive targeted data collection (theoretic sampling) to extend, test and develop our analysis and conclusions. Grounded in the data we analyze, this paper makes the following contributions: • We discuss preliminary hypotheses which explore stimuli and inhibitors to testing at a socio-technical level • We present a computer aided approach for qualitative analysis of sentimental expressions in big datasets • We motivate a research agenda that includes concrete ideas for targeted data collection (theoretic sampling) to develop a mature theory of stimuli and inhibitors of software testing that go beyond tools and technology 2. Background

Sentiment Analysis
Sentiment analysis is the computational study of opinions, sentiments and emotions expressed in text. It essentially tries to infer people's sentiments based on their language expressions. Sentiment classification is a widely studied research topic of sentiment analysis that focuses on the classification of opinionated documents as expressing positive or negative opinion [26]. Automatic classification of sentiment has been applied in various fields of research over the past 20 years as access to vast amounts of written text about various topics have become available through the internet. Already in 1999 Wiebe et al. [58] worked on a dataset for automatic classification of news articles to identify whether information is being presented as fact or opinion. While sentiment analysis is still being used to analyse media platforms like those of news agencies [5,46], its application today also includes platforms on which a wide variety of people contribute content such as social media or internet forums. Here sentiment analysis has been used recently to identify personal attacks or obscene behavior of users [50].
Techniques for sentiment analysis have also been applied in the context of software engineering. Mäntylä et al. [45] analysed sentiment in comments on the Jira issue tracker to detect burnout among software developers. They calculated sentiment scores for each sentence using a dictionary that contains ratings for the affective meaning of 13,915 English words. Despite their positive results, they have also raised the issue, echoed by others [29], that general purpose sentiment analysis tools lack precision when applied to the domain of software engineering. Lin et al. [34] even question the validity of all quantitative studies in software engineering based on sentiment analysis tools as they demonstrate how hard it is to reproduce results. For example, they judge that there is still a long way to go before researchers and practitioners can use state-of-the-art sentiment analysis tools to identify the sentiment expressed in Stack Overflow discussions. Motivated by this wave of criticism others then tried to develop tools tailored to the domain of software engineering like Islam et al., who have developed the dictionary-based tool DEVA [28], and a machine-learning based tool called MarValous that focuses on emotion detection [27]. In that same period the SentiStrength tool, which already existed as a general purpose tool for sentiment classification, was tweaked for an application in the domain of software engineering by Ahmed et al. [1], who created the tool SentiCR. Finally, Zhang et al. address the issue again in 2020, comparing the accuracy of this new generation of tailor-made sentiment analysis tools for software engineering with the accuracy that deep neural network architectures, namely transformer models, achieve [62]. They suggest that transformer models like RoBERTa are indeed one big step forward on the long way towards reliable results in sentiment classification for software engineering [35].

Grounded Theory
Grounded theory (GT) is an analytic approach used to construct ethnographic knowledge [18]. Its framework is made up of data-gathering techniques and strategies to analyse data. What distinguishes GT from other approaches is its iterative nature. While theory development progresses, the GT approach alternates between data collection and analysis to sustain a high level of involvement with the data [16].
GT was suggested as an approach for qualitative research by Glaser and Strauss [21] and has been reinterpreted by different scholars, resulting in the development of different flavours of GT. Flavours of GT differ in details on how to execute techniques and how tightly strategies need to be followed 3 . Crucially, they also rest on different epistemological stances. Where the original Glaserian GT takes an objective, positivist stance, Constructivist GT proposed by Charmaz, for example, acknowledges the researchers' subjective perspective. Constructivist GT moves away from positivism, incorporating the believes and preconceptions of the researcher into analysis. Situating the GT approach into the field of software engineering research, Hoda has recently proposed another flavour of GT. She designed Socio-technical GT (STGT) to ease application of GT in her field, where researchers often struggle to understand and apply it [25]. With STGT, Hoda proposes to divide GT into distinct phases. Embracing the iterative nature of GT, STGT encourages exploration in a Basic Stage and helps the socio-technical researcher to transition into an Advanced Stage of theory development. The separation into those two stages, which are accompanied by lean and focused literature reviews, help the socio-technical researcher to cover epistemological blind-spots. All flavours of GT use comparative-(e.g., clustering, diagramming), and analytical methods (e.g., coding, memo writing), that are accompanied by a continuous collection of new data samples (theoretical sampling) to saturate emerging categories that describe data and to enable the development of mature theories which transparently emerge from the data. Regarding the analysis of documents, which we set out to do in this paper, Charmaz states that GT of documents is able to address not only content but also their audience, production and presentation. Analysis of documents can reveal what and whom they affect, as they do not only serve as records but explore, explain, justify and/or foretell actions [16, p. 46].
In this paper we follow Hoda's STGT and present the results of the Basic Stage of our STGT study. Publication of emerging results of this exploratory phase is encouraged by both Hoda and Charmaz. GT guidelines describe steps and a path through a long research process. Depending on the task and project at hand, GT invites to use those steps flexibly to raise the analysis to the desired level of theory construction [16]. Within the framework of STGT we use strategies and epistemology from Charmaz Constructivist GT, raising the data analysis of our dataset that we take from Stack Overflow to a preliminary theory. We present our work following Hoda's recommendation who states that publication even of partial results is important to receive feedback from both practitioners and the research community to assesses relevance and improve rigour [25].

Stack Overflow
Stack Overflow is the most popular question-and answer website for software developers [7]. The website has become an important resource that often complements official documentation of software libraries and tools. Its strong presence on search engines, where a link to the website is very often shown on the first page of results when searching for software development related topics, indicates its reach that goes far beyond the 17 million registered users [9]. Studies that often use the official and open Stack Overflow dataset, have underlined the prominence of Stack Overflow by showing for example that 11% of open source software projects on GitHub that were analysed in a large scale field study contain source code snippets that were copy pasted from Stack Overflow [6]. Over 22 million questions that often contain such code snippets were posted by users in a wide range of topics that are related to software engineering since its launch in 2008 4 . Apart from contributions in the form of questions and answers, users are also encouraged to take part in moderation efforts. Up-and down voting, tagging and editing of questions and answers is rewarded with badges, medals and reputation points. Questions on Stack Overflow generate living documents, which are edited by their authors and moderators, updated, and extended with comments sometimes even a decade after they were asked. Questions and their answers can thus take the character of knowledge base articles. Barzilay et al. even argue that the moderation and reward system has transformed Stack Overflow from a mere Q&A site into a community project that gives users a sense of belonging which not only generates high quality knowledge but also trust in the content that is accumulated [9]. To emphasize these traits of the content on Stack Overflow that goes beyond questions, we refer to the content on Stack Overflow not as questions but as posts.
Before taking part in the community by asking a question for the first time, users can take a virtual tour that explicates the goals of Stack Overflow. It is explained here, that Stack Overflow is "all about getting answers. It's not a discussion forum. There is no chit-chat". Furthermore, it asks users to avoid questions that are primarily opinion-based, or that are likely to generate discussion 5 . The platform's focus to avoid chit-chat is also reflected in what Vadlamani and Baysal [57], and Zagalsky et al. [61] identified as the primary drivers behind contributions. Going beyond a meta analysis of the platform, scholars used Stack Overflow to investigate various aspects of software engineering, including for example the analysis of trends [8,59] or developers' interests [33]. Similar to our aim, the Stack Overflow data set has also been used to investigate challenges of software developers. Based on the assumption that questions and answers on Stack Overflow cover a wide range of issues, Alshangiti et al. [2] analyzed questions in a mixed method study to identified challenges of software engineers when developing machine learning applications.

Method
To investigate the lived experience of practitioners on Stack Overflow we take a qualitative approach that aligns with Hoda's Socio-Technical Grounded Theory (STGT) [25]. Acknowledging its iterative nature, we focus on what Hoda defined as STGT's Basic Stage for data collection and analysis. We take our initial sample from the Stack Overflow data dump, which we analyze using initial and focused coding while we write memos to constantly compare documents, codes and emerging categories. We then present preliminary hypotheses and an interpretive theory that summarizes our findings. Presenting our findings we motivate for the next iteration of our STGT study that leads to the collection of more data (theoretic sampling) to extend, test, and develop our findings. As Hoda [25] suggests, we publish our initial findings to assess the relevance of our work and to receive feedback from the research community. Successive rounds of data collection and analysis in future work can then lead to the development of more mature theories that are valuable for the field.
Our stance with regard to our research questions is that the reality of testing practices and the experience of practitioners in a complex socio-technical environment is highly individual and not reflected by a Stack Overflow post in its entirety. Within the framework of Hoda's STGT we adopt a subjective, constructivist epistemology. Therefore, we follow Charmaz's version of constructivist Grounded Theory [16] to provide our interpretation of these complex matters. Despite our awareness of the limitations that an analysis of nonreactive documents has as they can only provide thin descriptions that lack contextual cues [24], we hypothesize that observation and thorough investigation of attitudes and sentiments expressed by practitioners in posts on Stack Overflow can yield valuable insights into practice. Furthermore, we claim that our analysis contributes to a better understanding of socio-technical dynamics in the context of software testing.
To analyze the Stack Overflow dataset for our specific purpose of investigating the sentiment associated with software testing, we first retrieve Stack Overflow posts related to testing We then use sentiment analysis tools to identify posts that contain negative and positive sentimental expressions.
The whole process, which starts with this filtering process of the Stack Overflow data dump 0 and ends with the construction of preliminary hypotheses and an interpretive theory 22 , is visualized in Figure 1. Grounded theory studies usually undergo a phase of piloting and study preparation as a means to verify  Figure 1: Filtering and annotating Stack Overflow posts using a semi-automated approach, followed by systematic qualitative data analysis process that leads to the construction of preliminary hypotheses and an interpretive theory.
that the chosen tools like questionnaires or interview questions are appropriately configured and comprehensible to the studies' subjects. As our study is only involving the analysis of non-interactive documents such a verification process is not applicable. Study preparation in our case is thus limited to the extraction of a subset of posts that we take from the Stack Overflow data dump and the configuration of the sentiment analysis tools that we use ( 1 to 5 ).

Filtering by tags
The Stack Overflow dataset contained 53,086,328 posts concerning all domains of software development when we obtained it in August 2021 6 . To extract a subset with a size that is appropriate for manual analysis, we filter all posts using a 2-step process that is outlined in this section. As illustrated in Figure 1, we begin with the full Stack Overflow Post-dataset 0 on the left side and end this process with importing post-documents into a CAQDA-software 7 7 on the right side. To extract posts related to automated software testing, we first filter the dataset using tags. One or more tags are assigned to every post by their authors. The list of tags is then often edited by moderators to facilitate categorization. Tags represent categories that among others include general concepts or methods (e.g., testing, tdd), technologies like programming languages (e.g., java, python), or specific frameworks and tools (e.g., codecov, mockito, reactjs). Posts are usually tagged with multiple, complementary tags (e.g., post 878848 is tagged with 5 tags: java, unit-testing, ibdc, mocking and resultset). Similar to Yang et al. [59] we utilized a 6 archive.org/details/stackexchange 7 CAQDA = Computer-Assisted Qualitative Data Analysis software; we have mostly used ATLAS.TI, see: https://atlasti.com two-step process to extract posts by searching for tags which represent general concepts and methods related to software testing. We first select all posts from the dataset that are assigned a tag that contains the word testing, which produces a set of 134,109 posts 1 . We choose the term testing as it is used as a suggestion on the Stack Overflow platform whenever the tag test is used and because the tags testing and unit-testing are the two most prominent tags when searching for test using the tag-search. 8 We then manually analyze the list of 13,006 tags that were assigned to those posts and remove tags that were used less than 6 times, or were not directly referring to general concepts of automated software testing 2 . The tag codecov for example was removed from the list because it only occurred 5 times, and reactjs was removed as it relates to a programming framework that is not directly related to automated software testing. We also exclude tags that are related to testing but focus on a particular technology or tool (e.g., mockito), as we try to remain testing tool-and development language agnostic. Following this procedure, we have produced a list of 30 tags that all refer to conceptual aspects of automated software testing, like unit-test, mocking, or tdd. Using this list we again extracted posts from the original dataset. We extract all posts that contain at least one tag that is present on the tag list and obtain a set of 147.833 posts 3 . Post 878848 which is tagged with java, unit-testing, ibdc, mocking and resultset was for example selected because the presence of tags mocking and unit-testing. We provide the source code of the program that we used to filter posts and the filtered dataset in our replication package [56, filter-by-tags.zip].

Filtering by sentiment
We aimed to examine posts deeply instead of quantitatively which limits our investigation to an analysis of a small subset of the 147,833 posts. From research done by other authors we know that only a small fraction of content posted on Stack Overflow contains strong opinions and emotional statements as they mostly discuss how to use a piece of technology [34]. Sengupta et al. report that only every 10th comment on Stack Overflow expresses some standalone form of emotion [53]. This motivated us to create an emotionally rich subset by filtering the dataset using a semi-automated approach that employs sentiment analysis to select posts that contain sentimental expressions. Following the advice of Zhang et al. [62] to not rely on a single tool we used the transformer model RoBERTa [35] in combination with the SentiCR tool [1]. We trained both tools with a labeled dataset of Stack Overflow provided by Lin et al. [34] 4 9 . Their dataset contains 1,500 sentences from Stack Overflow posts discussing Java libraries which were manually labeled by the authors with sentiment polarities positive, negative and neutral [34]. We then used the trained tools, to automatically annotate sentiment polarities to every paragraph of ev-ery post of our tag-filtered dataset 5 . From this annotated dataset we then randomly extracted posts from 5 categories, using a simple condition for each category 6 .
Positive: both tools classified at least one paragraph as positive and none as negative Negative: both tools classified at least one paragraph as negative and none as positive Both: both tools classified at least one paragraph as positive and at least one as negative Neutral: both tools classified all paragraphs as neutral Random: randomly selected independent of classification Especially because of concerns raised by Lin et al. [34] and Jongeling et al. [29] who state that sentiment analysis tools often do not provide good results for software engineering texts, we used the last two categories Neutral and Random in a later stage of our analysis to validate our semi-automated filtering approach. We evaluate whether filtering posts with the tools RoBERTa and SentiCR provides a dataset with more sentimental posts than a random selection. We choose paragraphs instead of finer grained sentence-level separation because we hypothesise that a paragraph is more likely to hold a comprehensive and conclusive thought as compared to short sentences that are taken out of context. We argue that sentiment classification done on that level better supports our goal to group posts into categories of positive and negative posts. Contrarily to what we want to achieve, one short and slightly negative remark in a post of an otherwise very positive paragraph, is much more likely to determine a wrong result in a finer grained sentence-level classification. The sentiment analysis tools we used in this study both support the approach of classifying text with multiple sentences. The posts obtained by our semi-automated filtering approach were imported into a CAQDA software 7 that was used to aid all further steps of the data analysis. Initially we analyzed 25 posts from each category (Random, Neutral, Positive, Negative and Both). We then added another 25 posts from each sentimental category (Positive, Negative and Both), to reach a point at which the analysis of additional posts did not provide new insights or perspectives in the form of new codes. After adding the second batch of 75 posts, and before reaching the 200th post we reached saturation. Posts did not provide new content that did not fit into the categories which had emerged already at this point. We therefore analysed a total amount of 200 posts. Figure 2 shows the 20-most occurring tags that were assigned by authors and moderators to those 200 posts. When creating our dataset and selecting the posts, we looked for sentimental discussions about testing without selecting or excluding specific technologies. We do not focus on how practitioners sentimentally evaluate specific tools, e.g., the Java unit testing library junit. We instead take a broader, tool agnostic perspective. Nevertheless, to  provide context to our dataset, it is interesting to observe which tags (both tool agnostic and tool specific) are assigned to the questions that are included in our dataset. In particular, these tags indicate that our dataset transcends a particular programming language or technology stack. The replication package we provide contains the source code of our implementation of the sentiment analysis pipeline [56, filter-by-sentiment.zip].

Data Analysis
We employed strategies from grounded theory as recommended by Hoda [25] and Charmaz [16] to analyse the filtered Stack Overflow dataset. To begin the iterative process of constructing abstract analytic categories out of which we formulated preliminary hypotheses as illustrated in Figure 1, we use initial coding 8 , applying codes to the dataset line by line in three rounds. We started without any preliminary codes, remaining open to all possible theoretical directions especially during the first coding cycle. In addition to coding posts with gerunds (e.g., describing instead of description), we use In-Vivo codes, which are quotations of what the author of a post wrote in their own language. In-Vivo codes are put in between quotation marks and used whenever authors express themselves in a strong and emotionally rich way (e.g., "is my code just bad?") 9 . In order to provide basic statistical information about the occurrences of negative and positive sentiment in the dataset, we use magnitude coding as suggested by Saldaña [51], adding the symbols + andto codes where applicable 10 . Negative expressions are coded with a minus (e.g., -Reflecting unclean approach), and positive sentiments with a plus (e.g., +Embracing change) respectively. We write memos during all stages of our data analysis 11 which we use at a later stage to develop preliminary hypotheses 12 .
After three rounds of initial coding, we reassess the significance of all codes to decide which ones contribute most to an incise and complete categorization. As Charmaz suggests, we use this technique to condense the work of the initial coding phase to advance the theoretical direction of the work and to begin with a second cycle of focused coding 13 [16]. During focused coding cycles we develop initial codes into focus codes 14 and categorize documents while we construct and continuously refine a codebook 15 . In our codebook we spell out details like inclusion-and exclusion criteria, descriptions, and examples for each focused code. Because of suggestions made by Lopez et al. [36], who have shown that comments on Stack Overflow can reveal expressions of pride and emotional involvement, we also incorporate comments made on Stack Overflow into our analysis. Other additional information obtainable via the Stack Overflow website, like the history of changes made by the original author or a moderator are also considered during focused coding. We understand a post as a potential entryway into a deeper and richer context of an author's question. Details including the sentimental activity in comments, the editing history of a post both by the author and moderators, the reasons for a moderator to close a post, the time it took the community to answer the question or the fact that it was never answered. Where a post offers these details (not all of them do), we capture the information by writing analytical memos. One memo about post 55357595 with the title Fruitless pursuit written by one of the authors for example reads:

Memo: Fruitless Pursuit
The author of this post did not receive any feedback from the community. But almost a month after posting this question, the author just comments: "Ended up setting up a webpack from the ground up" Which I think indicates that this person has gone through quite some torment. However, they do not express this explicitly.
During the process of focused coding, we also assign a sentiment of positive, negative, both or neutral to each post. Here the assigned sentiment represents the overall attitude of the author towards testing practices 16 . We use both the coding of sentiment 10 and assignment of the overall sentiment 16 , to determine the accuracy of the sentiment analysis pipeline 17 and to evaluate its use in our filtering process 18 . During the focused coding cycles, preliminary analytic categories became visible to us 19 . A large amount of negative posts containing expressions of desperation for example, developed into the category Discouragement early on. We refine categories that become visible through the process of coding, using a diagramming technique described by Saldaña [51] 20 . Starting with a code like Expressing desperation or a post that creates ambiguity when assigned with a category, we sketch a network of connections to other posts, categories or codes on paper to explore detailed features of the coded dataset from different angles. We then use the clustering strategy as described by Charmaz [16], grouping posts together and writing memos, concentrating on commonalities and differences among those groups of posts 21 . Taking a different perspective each time, we find different explanations for the meaning and context of sentiment expressed by practitioners in posts. We continue the process of analyzing the dataset using these strategies, until they no longer yield new perspectives and we were able to formulate preliminary hypothesis and an interpretive theory that emerged from the process 22 .

Constructing Interpretive Theory
Synthesizing the insights and hypothesis we obtained by engaging with the data through the whole data analysis process described above, we formulate an interpretive theory. Interpretive theory aims to offer accounts for what is happening, how it arises and explains why it happens [16, p. 230]. In this work we approach interpretive theory and its construction from a pragmatist viewpoint. We recognize that our statements can only correlate our interpretation of the experience of individuals with our own experience, and the body of knowledge from the field that is available and known to us [41]. Taking this viewpoint we emphasize practice and action rather than trying to explain the empirical phenomena described in the analysed data by providing laws that are testable by empirical objective observation. Concretely, interpretive theory in this paper concerns what authors of posts assume about what they describe, how these assumptions or views might have been constructed, and how the authors seem to act on their views. By taking this approach of theory construction, we want to make phenomena and relationships between them visible in order to open up new vantage points for our own and the future work of others. We understand theorizing as an ongoing activity that can be continued through this future work [16].

Results
In this section we describe our findings and offer an interpretation of the data we analyse to answer the research questions. We first discuss the result of applying sentiment analysis tools to create a dataset that is rich in sentimental expression. We then present the results of our qualitative data analysis of this dataset to first show how software engineers express sentiment about testing and which underlying factors contribute to their sentiment. We then present a preliminary interpretive theory that synthesizes our findings. The data in which this preliminary theory is grounded, and all artefacts that are discussed in this section are contained in our replication package [56, coded-dataset.qdpx].

Unrelated Buckets
Expressions Overall to follow our analysis by using the online content on Stack Overflow. We enable this by providing a link to the original post on the Stack Overflow website that can be followed by clicking on the ID next to the quotation of a post. Example quotation: "This is all working as I would expect"(3340677 )

Sentiment analysis for qualitative research
Our sentiment analysis pipeline takes a Stack Overflow post as its input, classifies each paragraph of the post independently using two different sentiment analysis tools and takes the result of both tools into account to indicate if a post is likely to be positive, negative, neutral, or mixed in sentiment. Using this pipeline we created buckets of positive, negative and mixed sentiment posts, containing 50 documents each and added an additional 25 neutral and 25 randomly selected posts to our analysis in order to validate our method. Our motivation to filter the dataset using sentiment analysis tools stems from research by Sengupta and Haythornthwaite [53], which indicates that randomly selecting posts from the Stack Overflow dataset will only provide few sentimental posts, as the majority of posts is objective or focused on technical issues. Our approach relies on multiple sentiment analysis tools to address a problem that was identified by Lin et al. [34] demonstrating that sentiment analysis can introduce a strong bias when relying on a single tool. In Figure 3 we compare the classification of our sentiment analysis pipeline (left column) with the sentiment that we actually identified in posts during initial coding (center and right column). We differentiate between occurrences of sentimental expressions in documents (center column) and the overall sentiment of a document (right column). Using the metrics which are visualized in Figure 3 we evaluate how suitable our method is to create a dataset that can be used to find answers for our research questions and if it is applicable for other qualitative studies on Stack Overflow.

Occurrences of sentimental expressions in posts
Occurrences of sentimental expressions in posts were identified and annotated during the first coding cycle when posts were coded line by line. The line "I understand that using aunit can be a time-saver"(3412892 ) was classified as positive for example, but the same post also contains the expression "I looked at the aunit manual and I didn't find easy examples to start with", which was classified as negative. Post 3412892 , which we took from the positive bucket, was therefore assigned the category of both sentiments at the level of expressions. The flow from the first to the second column in Figure 3 shows this relation, presenting which posts from each of the sample buckets contained expressions of the respective sentiment. 20 posts from the bucket of positive posts for example indeed contained one or more positive sentimental expressions and no negative ones. In Figure 3 this relation is represented by the flow from positive in column one to positive in column two, highlighted in green. However, 2 of the 50 posts from the same bucket did not contain a positive expression but at least one negative expression (flow from positive to negative), 7 posts contained at least one expression of each sentiment (flow from positive to both) and 21 posts from the positive bucket did not contain any sentimental expressions (flow from positive to neutral). Flows from the negative and positive buckets to the neutral category in column two indicate that a lot of posts identified as positive or negative by our pipeline in fact did not contain any sentimental expressions. Comparing this lack of accuracy with the results for documents that we obtained from the random bucket suggests however that our sentiment analysis pipeline indeed managed to select more sentimental posts than a random selection would have. Crucially, we did not find a single positive expression in the set of 25 randomly selected posts. Additionally, comparing the remaining flows between column one and two in Figure 3, we see that the majority of posts that turned out to contain sentimental expressions were indeed extracted from the respective bucket. The findings of this first analysis of the accuracy of the sentiment analysis pipeline therefore supports our hypothesis that a semi-automated approach proves beneficial when used to create and analyse a subset of Stack Overflow posts with both negative and positive sentiment.

Overall sentiment of posts
In Figure 3 the last column shows the conformity or difference of the overall sentiment of posts determined by us in comparison with our pipeline. We determined the overall sentiment during the second, focused coding cycle. During this analysis, we realized that 39 posts were not usable for further inquiry. The majority of those posts were too short (34); one author simply asks "Which is the best framework for automatic testing in octave? Why?" (2073244 ). The other five of those unusable posts were identified as unrelated to our work, like a post in which a practitioner asks "How to use Jquery Ajax Cache"(2398092 ), mentioning testing but referring to something that is unrelated to automated testing. The dark green and dark red flows in Figure 3, from column one via column two to column three show that posts from the positive and negative buckets that contain expressions with that sentiment were mostly leaning into that direction overall as well. There are only a few outliers of posts that were for example classified as negative by our pipeline and indeed only contained negative expressions but were found to express an overall positive sentiment. One such post contains the negative expression that "[it] is copy-paste code, which I thought was generally not recommended" (9271925 ), not mentioning anything positive or negative apart from that. However, the overall sentiment of the post was interpreted as positive as the author shows a constructive willingness to improve while being open and concious of their own mistakes. In total, there were only 12 such cases where the sentiment classification of the pipeline completely diverged from our classification. Documents from the both bucket of our dataset, even when they indeed contained expressions of both sentiments were in most cases negative overall. The analysis also shows that the both bucket contributed the most sentimental posts to our dataset. Our analysis of the overall sentiment of posts indicates that subtle remarks and the context of a sentimental expression makes the overall classification of posts difficult. Subtracting unrelated (5) posts, randomly selected posts (25) and those that were too short for analysis (34), we can report that the sentiment prediction was correct for 46% of all documents (65 of 141). Overall our approach yielded a dataset in which approximately half of all documents were sentimental (108 of 200). We provide an annotation file with our replication package that contains sentiment annotations for each post that we analyzed on both the level of expression and overall, including the source code to generate graphs and statistics from that annotation file [56, data/annotations.json].

Sentiments that affect attitudes
Before describing and comparing occurrences of sentimental expressions which we identify in the dataset by presenting focused codes and analytical categories, we provide examples which demonstrate how we moved from the data, through codes, towards a more abstract interpretive theory. Document 878848 was first coded line by line and was assigned, among others, the initial code -Expecting a lot of Work From Mocking. The code with the prefix "-", which indicates that the expression reflects negative sentiment, was assigned to the following line: "Use EasyMock, write looooong mocking sequence. VERY BAD solution: hard to add initial data, hard to change data, big test debugging promices.". During the second and third initial coding cycle the code was then changed to -Expecting Mocking to be Bad Solution. Other posts hold similar notions and were coded with the same code (e.g., "There is no point in mocking out a whole ngrx entity store, so I would just like the selector to return exactly that object and be done with it." (58840818 )). During focused coding, the code changed once again and became more abstract and analytical: "Judging subjectively". The comparison of posts with similar codes revealed that expectations which are expressed sentimentally, like the examples above, are not based on objective observations but on subjective perceptions often connected to personal experience. The intention (or action) of the author here does not seem to be the objective revelation of their expectations, but the subjective judgement in order to position themselves. In one memo titled Experienced ambiguity this notion of subjective judgement and ambiguity was noted by one of the authors during focused coding.

Memo: Experienced ambiguity
The practitioner is struggling with adopting a new framework. Some things are easy and some are challenging. The practitioner is faced with a situation in which there is no easy or obvious way forward. They are stuck and forced to make an uncomfortable decision. However the willingness to resolve the ambiguity here still reflects a very positive attitude. The practitioner already has some clues and they are reasoning from experience. Looking at the comments, I realized that the post was closed quite quickly. It only took about 10 hours and the issue was solved by a maintainer of the framework project which is mentioned in the post. The fact that the author of the post reacts very enthusiastically supports my hunch that their attitude was actually quite positive all along.
The memo was originally created when analyzing another post (823276 ), but was then connected to post 878848 as well. Later, during a diagramming session, the aforementioned memo, some related focused codes and both posts (823276 and 878848 ) were assigned to a collection labeled Confidence which generated new memos and more abstract perspectives. Both this collection and the memo mentioned above also contributed to the forming of the categories Aspiration and Exploration. Post 878848 , which ultimately ended up in the category Aspiration and was categorized to reflect both positive and negative sentiment, further revealed what might be the conditions for aspiration to arise in the context of software testing. We compared the post with others of the same category and identified that knowledge and experience seems to enable practitioners to stay positive despite being stuck in situations where there is no obvious way forward. Concretely, we hypothesize that the notion of explicitly comparing capabilities of approaches, not only in terms of features, but also in terms of maintainability, indicates confidence and experience of the author on Stack Overflow. Ultimately, memos written about those considerations and others enabled us to construct the preliminary interpretive theory which we present at the end of this section. Specifically, the aforementioned post 878848 supports the hypothesis that experience and knowledge can give practitioners an extra degree of trust and confidence, from which an aspirational attitude towards testing seems to emerge.

Focused codes
Using focused coding techniques as recommended by Charmaz [16], we identified 22 codes that were assigned to a total of almost 700 different text sections of the 200 posts that we analyzed. Table 1 lists all codes, a description for each, and a diagram that indicates how many posts that contained the code were identified to be either positive, negative, neutral or of both sentiments. The full codebook that we provide as part of the replication package of this paper contains inclusion and exclusion criteria, and examples for each code [56, codebook.ods]. Making a statement to restore confidence. Like a claim that a manual has been read, or a tutorial has been followed.
Pursuing Ambition (F.3) Constructive attitude to achieve a goal. The implementation of something, extension of knowledge or something else that goes beyond just getting the job done.
Willing to Improve (F.4) 14 4 11 13 (42) Author indicates that they have an ambition to change and improve something.
Facing Uncertainties (F.5) Expression of insecurity through description of ambivalence or doubt.
Expressing Desperation (F.6) 31 7 (38) Author expresses their desperation directly, either by asking a question or by indicating that they are clueless.
Judging Subjectively (F.7) Explicit subjective valuation of the apparent characteristics, behaviour or value of something.
Admitting Lack of Knowledge (F.8) Searching for a New Path (F.9) The goal or approach has been thought through but the author hunches that there is another, better way.
Contemplating Complexity (F.10) 7 5 10 9 (31) Author is describing something that has to do with the complexity of a setup or use-case. Complexity is either highlighted reflected implicitly.
Missing Capability (F.11) 2 4 13 11 (30) Description of issues, circumstances, hurdles or other discomforts that stop one from reaching a goal. Capabilities can be the capabilities of a software, its limitations, but also the own capabilities to solve an issue.
Referring to External Information (F.12) Reference is made to a resource that is accessible to the author. Documentation, blog posts, books etc.
Contemplating Failure / Difficulties (F.13) Author shares their opinion about what they find difficult or failure they are facing.
Looking for Starting Point (F.14) Request for a starting point to tackle something that is unknown or unclear.
Facing an Obstacle (F.15) 11 14 3 (19) An obstacle makes it impossible to continue with a task. The author is stuck because of the obstacle.
Reflecting Experience (F.16) Positive or negative reflection which is related to past experience.
Struggling to Understand (F.17) Author is struggling to grasp the meaning of a faced problem or a concept they want to learn. Like admitting that they are not able to comprehend something or that something is hindering them to learn something.
Seeing Own Mistakes (F.18) Realization of an error or a misconception. Revelation of having done something in the wrong way or in a way that can be improved.
Comparing Different Approaches (F.19) Description of multiple angles to solve an issue or a task.
Trial and Error (F.20) Describing different attempts to get to a solution which are all unsuccessful.
Aiming at a workaround (F.21) Practitioner identifies that a situation can be solved by using some workaround which is probably not the ideal solution.
Excluding Solution (F.22) There is a solution for a problem but the author does not want or cannot use it.
Comparing the codes and corresponding posts with each other reveals underlying sentiment of practitioners that relate to testing practice. The codes reveal patterns that affect attitude and testing practices of software engineers and allow us to propose answers to RQ1.

RQ1
How do software engineers express sentiment about testing on Stack Overflow?
In total, the dataset that we have analyzed contains 108 sentimental posts. In 32 posts, practitioners expressed positive sentiments, 63 posts were negative, and 13 contain both sentiments. Total amount of sentimental posts: 32 13 63 (108) To highlight some of the patterns which show how sentiment is expressed, we elaborate on the eight most occurring codes from . About one third of sentimental posts (32 of 108) contained an explicit subjective statement about apparent characteristics or value. Subjective expressions like that of one practitioner who "fell in love with the crisp syntax [of a framework] immediately"(1072952 ) underline the attitude of the author. Negative attitudes connected to judgement like one practitioner reflecting on a specific practice which "seems like a waste of time"(29894788 ) were rarer in the dataset than positive attitudes. One practitioner for example reflects positively "that [running tests concurrently] will force [them] to refactor some code to make it thread-safe, but [they] consider that to be a good thing :-)"(4970907 ). In total, more than one third of all positive posts (14 of 32) contained a subjective judgement compared to only every fifth negative post (12 of 63).
Lack of Knowledge (F.8), Facing Uncertainties (F.5) and Reassuring the Reader (F.2). Outlining the limits or lack of their own knowledge and abilities by stating for example that they are "a newbie"(29894788 ), or indirectly pointing out that they are "stuck trying to [...] test an extremely simple project"(62177256 ) occurs both in positive and negative posts in around a quarter (27 of 108) of all sentimental posts. In addition to describing their own limits by admitting a lack of knowledge, we identified descriptions of ambivalence ( "Which is the correct way?"(41262775 )), doubt ( "Has anyone done anything similar before or is this crazy?"(7213917 )), or uncertainty ( "It seems to me that, I maybe should be creating a Fake MaterialRepository, rather than mocking it?"(23534123 )) expressing insecurity in around a third (35 of 108) of all sentimental posts. We also found statements indicating that the author is trying to maintain or restore their confidence by reassuring the reader in more than a third of sentimental posts (43 of 108). One author for example is stuck in a situation where they observe something unexpected and they "want to understand why that is like this" (39592949 ), wondering if "there is a better way", even being afraid that their "code is just bad" but still holding on to their approach as they reassure the audience that "When [they] change [something,] everything works fine".
Pursuing Ambition (F.3) and Willing to Improve (F.4). Uncertainties and a lack of knowledge were found equally frequent in negative and positive posts, but descriptions of constructive attitudes to achieve a goal that goes beyond just getting the job done were mostly found in positive posts, or posts that contain both sentiments. We identified direct expressions of ambition by practitioners for example "to create a support library that could be used by all test projects"(18399610 ), or mentioning the context of a challenge that underlines its ambitious nature like "writing acceptance Expressing Desperation (F.6) and Unexpected Behaviour (F.1). Contrarily to ambitions we also found expressions of despair by practitioners who are stuck saying that they for example "googled wide and far, but did not get any answer"(58840818 ), or remain completely helpless, begging for support like one practitioners who asks: "Can somebody please, please, please for Pete's sake [...] fix this bug that thousands are having?" (44762082 ). We did not observe expressions of desperation in positive posts or posts with both sentiments, but we did find them in almost half (31 of 63) of negative posts. Additionally, we identify descriptions of unexpected behavior in more than half of negative posts (41 of 63). Covering a big fraction of the dataset, unexpected behavior is experienced by practitioners in many different contexts, referring to testing practices or the development environment ( "When I test it in browser, everything is OK, because App\User exists, but when I test my plugin, App\User doesn't exists"(52760148 )), or referring to something that is not directly related to testing but discovered through it like facing a floating point precision error for the first time, noticing that "When I'm running the tests it's broken because 0.1 is not equal to 10%"(63886733 ).

From Codes to Categories
We use codes to compare posts with each other in a structured way. Codes enable us to scrutinize the dataset from different perspectives. Co-occurrences of codes within posts for example reveal patterns in the data that can be indicators for categories. We identified four major factors that describe the non-technical, situational context of sentimental posts with which we can categorize the posts. In this section we present each category and their characteristics, highlighting key insights that emerged from the data during our analysis when categories were outlined. The categories reveal underlying currents that affect the testing practices of software engineers. Categories which highlight what influences their attitude and motivation are the basis of what we propose as answers to RQ2.

RQ2
Which factors affect sentiment of software engineers towards testing practices?
Discouragement (C.1) 42 10 (52) We found that attitude in negative sentimental posts is often (42 out of 63) expressing discouragement (C.1) from testing. When authors sentimentally express discouraging setbacks in their testing efforts by contemplating difficulties or failure (F.13) they are at the same time often reassuring the reader (F.2), implying that the problem cannot be blamed on them [63795587 , 14942409 , 19490583 , 18083834 , 19799393 , 25264248 , 26370705 ]. Statements that a tutorial or documentation (F.12) was followed and thoroughly read, or reports of elaborate debugging (F. 20  Discouraging sentiment about testing is provoked in complex development environments. This includes company policies or unique infrastructure configuration. When such factors combine with technical issues, experienced by the practitioner as unexpected behavior, they create obstacles that discourage practitioners from testing. A complex environment makes the usage of a standard testing tool chain unexpectedly challenging, especially when practitioners lack experience in testing. Documentation or other external resources do not help in these cases and long fruitless pursuits of trial and error are reported.   Exploratory sentiment to discover and learn is expressed both positively and negatively by practitioners. Trust into method or technology based on experience or inspiring external impulses arouses positive attitudes. When exploration serves clarification in situations of uncertainty, it is the experience of unexpected behaviour of technology that causes negativity especially when practitioners lack experience.   Application of testing practices can lead to ambiguity. Applying the right method in a particular situation for example can be challenging. Awareness of blind spots and knowledge of the great variety of tools and methods, is a factor that allows practitioners to keep a positive attitude. Variety and ambiguity can than even be appreciated. When failure or complications causes ambiguity however, sentimental reflection is negative.

Reflection (C.3)
Aspiration (C.4) 11 3 5 (19) Opposite to posts from the category of discouragement, we identify aspiration in posts which express almost exclusively positive attitudes towards testing. Specifically, aspiration reflects a degree of freedom that allows exploration and discovery in a constructive way. In particular, the motivation is not to find a workaround or to overcome an obstacle, nor do authors elaborate on extensive debugging or trial and error. Instead, authors pursue ambitions (F.3) that go beyond a particular situation [  Understanding of long term goals and the value of testing arouses aspirational sentiment. Not being trapped in a problematic or complicated situation and not having to deal with an immediate obstacle creates space that is required for this aspirational attitude. It allows practitioners to build essential knowledge before their ignorance produces problems.

Factors that arouse sentiment
To answer RQ2 (Which factors affect sentiment of software engineers towards testing practices?), we summarize key insights we gained by developing the above categories. We identify that practitioners on Stack Overflow express sentiments when they are either discouraged (C.1) from pursuing their goal, aspiring (C.4) towards something that goes beyond their usual practice, reflect (C.3) on their testing experience and knowledge, or when they are exploring (C.2) what is still unfamiliar to them. Posts which indicate aspiration (C.4) are positive in sentiment, and post that describe notions of discouragement (C.1) from testing mostly reflect negative sentiment. Common factors can be identified even among those two almost inverse categories. Concretely, we identify that the experience of unexpected behavior is an important factor that leads to negative sentiment expressed through discouragement. Even when exploring (C.2) or reflecting on (C.3) testing practices to learn and gain knowledge practitioners express negative sentiments when they face unexpected behavior that causes ambiguity. Additionally, the data suggests that an absence of those unexpected setbacks enables conditions for practitioners to aspire. Through reflection and exploration, these conditions allow them to build knowledge and experience. Experience, which is likely to prevent those unexpected setbacks in the future. Trust in testing practices that is established through these experiences contributes to positive sentiments when new practices are explored. We find the same to be the case for an awareness of blind spots. Reflection (C.3) on their testing practices that express an awareness of blind spots reflects positive sentiment and attitude. Uncertainty in those cases inspire practitioners instead  of discouraging them.

Trust, Complexity and Testing -Preliminary Theory
We set out to discover what makes practitioners sentimental about testing by looking at how they express sentiment on Stack Overflow. We want to know which factors and situations contribute to sentiment. By analyzing, categorizing, and comparing the dataset, we got a glimpse of what the experience of practitioners, who ask questions on Stack Overflow must be like. Codes and categories described in the previous section enabled us to analyze the dataset systematically using techniques like clustering and diagramming. In this section we present a preliminary interpretive theory that describes what became visible from our perspective, which is grounded in the analyzed dataset. To let the data speak for itself, we provide references to the original posts on Stack Overflow immediately in the text. With each quotation from posts, we also provide a reference to the code that was assigned to the respective text section where applicable. Figure 8 illustrates our preliminary theory as an interplay of the most crucial factors which we identified to have an influence on sentiment towards testing on Stack Overflow. We first elaborate on the right side of the figure, which shows discouragement (C.1) in the context of software testing, and how the negative sentiment around it is aroused in situations where complexity plays a central role. We then turn to the left side of the figure, elaborating which role exploration (C.2), reflection (C.3) and aspiration (C.4) play in the context of testing.
"I was starting to break as much as I was fixing. So I decided I'll start from scratch, with TDD this time"(29894788 ) (F.3). Testing practices and approaches are multi faceted. Even in cases where practitioners are just "having a play with testing"(28129825 ) (F.4) to improve their code base, or just to "understand the essence of it"(44202672 ) (F.3), they are quickly faced with multiple tools and have to make difficult choices regarding the technique or tools to adopt for a use-case. The dataset that we analysed demonstrates that testing software is not a single tool or single method practice. We observe that the big landscape of software testing tools and the resulting diversity of possibilities to practice testing amplifies ambivalence when practitioners lack experience and knowledge [878848 , 1006189 , 12950163 , 601973 , 17320143 ]. The question whether or not "I [am] missing something in my pursuit of cool and trendy stuff [...] ditching the old proven [ways]"(2894608 ) (F.5) expresses the lingering insecurities of practitioners who are plunging into a world where many and often unexpected aspects of software engineering suddenly come together [823276 , 43435227 , 1454949 ]. As software projects get more complex, the ambition "to fully automate testing [...] in the most simple way possible"(16938742 ) (F.3) using advanced practices that are able to tackle this increased complexity grows as well. Our investigation indicates that this clash of lack of experience in testing on the one hand, and complicated challenges on the other hand drives attitudes around software testing [4991264 , 43435227 ]. As shown in Figure 8, as a circular pattern, we identify that a growth in complexity of either the development environment or the software project itself makes practitioners ambitious to learn (more) about software testing [1072952 , 16938742 , 1006189 ]. But a high level of complexity of production code (top of Figure 8) also requires complex testing code which in turn requires more than basic knowledge of testing (bottom of Figure 8). The interplay of growing ambition, a complex environment, and a lack of knowledge is reflected in a question about an easy way to write a unit test. Unfortunately, practitioners only start to face their ambiguities and insecurities around testing when they are "starting a new project, that promises to be much bigger and more involved than anything [they] have done in the past"(6684337 ) (F.4). In other words: instead of learning testing practices, starting with simple comprehensible setups and then iteratively building knowledge as the complexity of test suites and source code under test grow simultaneously, practitioners throw themselves into cold water when it is too late for simple, approachable solutions [19490583 , 6475042 , 53657417 , 4659714 ]. When the silver bullet is not found, they get discouraged to continue with their ambition [878848 , 63795587 , 14942409 , 7960832 ]. Our data analysis suggests that discouragement (C.1) is often connected to this phenomenon as expressions of desperation (F.6) indicate strong negative sentiment when practitioners are stuck (F.15), sometimes after they already "googled wide and far"(58840818 )(F.6), "searching for days to find an answer"(43435227 )(F.6). Unhelpful gathered information (F.12) which is often referenced in Stack Overflow posts only increase negative sentiment, and sometimes leads practitioners to identify unexpected behavior (F.1) of testing tools and libraries as weird or "strange behavior, because documentation says [that something should work. But:] Well, this is not happening."(63795587 ) (F.7) [19490583 , 26370705 ]. An explanation for this could be that documentation of testing tools and tutorials for beginners are more likely to focus on simple and standard use-cases [57609818 , 6475042 , 13309278 , 37527179 ,   34889215 , 14701609 ]. Based on our anecdotal experience as software engineers using testing practices, we hypothesize that a divergence from best-practices in both software design and development environment, requires practitioners to rely on testing experience. In the context of highly inventive or original approaches, simple tutorials for testing are not applicable. It is very likely that more than one testing library is required in those complex non-standard software environments.

Complexity in Testing Practice
Before we set out to investigate what lies behind sentiment around software testing on Stack Overflow, we assumed that it will mostly be connected to tool failure or bugs. We expected to find sentimental complaints about specific (missing) features in a specific version of libraries for example. Our analysis shows however that it is more likely to be a struggle in overcoming overwhelming complexity with methods or combinations of tools that practitioners are not experienced enough with which causes negative sentiment.
Testing software can confront practitioners with misconceptions or flaws of their software projects. One practitioners asks: "Is this a valid unit test? If not, is it because I have bad design [...]? Because currently, I see absolutely no benefit in writing this test"(44202672 ) (F.5). Even as the majority of sentimental post that we analyzed reveal discouragement and negativity as described in the preceding paragraphs, some authors maintain a constructive and even aspirational attitude (C.4), even when they are facing difficulties (F.15). We observe that positive posts rarely contain descriptions of unexpected behavior or expressions of desperation. In contrast, even in difficult situations, practitioners even express hope [59729159 , 1072952 , 34657563 , 53376098 , 41135403 ]. In a post of a practitioner looking for a way to test a WebAPI, they contemplate that "Back when WCF was the coolest thing, I did tests like this [...]. All programatically. It worked like a charm" (25325133 ) (F.16). Even though they experience difficulties (F.13), explaining that "for some reason [it] is REALLY hard to get to work (as in, I haven't succeeded yet)" (F.13), they do not seem to be discouraged and eventually find a solution that works for them. Another practitioner mentions that "in Katalon [there] is a very nice way to parameterize the selectors for GUI elements"(52539907 ) (F.16), searching for a way (F.9) to make their testing code cleaner. Yet another practitioner judges enthusiastically (F.7) that "[validating the correctness of every component in their system is] obviously going to be quite a lot of work! It could take years, but for this kind of project it's worth it"(1006189 ) (F.7), also emphasizing that they already "have a very comprehensive unit-test suite" (F.7) and going so far as defining what they believe to be meaningful tests (F.10).
We find that a commonality of positive posts is a sign of confidence of practitioners, or a trust in tools or methods that is grounded in positive experience (F.16) [67709670 , 46177956 , 14961412 , 1072952 ]. We also identify that ambition (F.3) and aspiration (C.4) in positive posts is connected by practitioners to their long term goals. One practitioner contemplates that "the code works 'properly' [...] but [they] think automated tests would be good for the longevity of the program"(48113464 ) (F.7), and another reports that they are "starting a new project, that promises to be much bigger and more involved than anything [they] have done in the past."(6684337 ) (F.4), which motivates them to "keep a good workflow with [their] test and make sure [they are] not creating gaps in [their] testing as [they] go" (F.9). As indicated in Figure 8 it is experience and knowledge that gives those practitioners an extra degree of trust and confidence, from which an aspirational attitude (C.4) towards testing seems to emerge. Their attitude enables them to reflect (C.3) on and explore (C.4) solutions for long term goals [4659714 ]. They build knowledge proactively without experiencing setbacks that discouraged (C.1) practitioners report [57609818 ]. On the left side in Figure 8 we visualize that exploration (C.2) and reflection (C.3) contribute to building knowledge that will eventually allow them to build trust and confidence. But, more crucially, seen at the top of the figure, we indicate that it is the context in which the ambition to test arises, that determines the sentiment towards testing when they engage in this process of building up knowledge. More concretely, when their environment and experience gives them confidence and if their ambition is grounded in an aspirational attitude, they remain positive [1006189 , 3340677 , 23062243 , 16938742 , 53657417 , 1072952 , 4659714 ]. But when their ambition to test emerges in situations when the complexity of their software projects begins to overwhelm them, the process of reflection (C.3) and exploration (C.2) is negative [37527179 , 67746901 , 58840818 , 4991264 , 7960832 , 25325133 , 6475042 , 18941509 ]. Testing is then perceived as an obstacle that might even push complexity further and not as something that is good for the future of a project.

Trust and Confidence -Degrees for Aspiration
Knowledge and experience in testing practices allows practitioners to aspire and enables them to consider and realize long term goals. It also enables them to reflect on their practice and explore new possibilities in a positive light. When exploration and reflection of testing practices are however motivated by pressure, for example an increase in complexity of a project, which rendered manual testing impossible, their ambition might be abandoned. Testing then turns into yet another obstacle.

Discussion
The qualitative analysis of 200 Stack Overflow posts revealed many different facets of software testing to us. In this section, we revisit our research questions in the light of these observations, their implications, and the recommendations we draw from them. We then present threats to the validity of these findings and close the chapter elaborating future work, that will open the next stage of our grounded theory research. Before revisiting our research questions and elaborating future work, we want to turn the focus once more on the filtering process that yielded the dataset that was analyzed in this paper.

Semi-automated filtering of datasets for qualitative and quantitative research
To narrow down our qualitative analysis of the Stack Overflow dataset we have used a semi-automated two-step process. We first filtered the dataset using tags and then employed sentiment analysis tools to extract posts which contain sentimental expressions. We therefore consider the first, tag based filtering approach that is inspired by Yang et al. [59] suitable for qualitative studies like ours. The low failure rate of the method in our case suggests that the approach is also suitable for quantitative studies of testing posts on Stack Overflow.
Regarding the second step, for which sentiment analysis tools were used, our evaluation is more differentiated. Our analysis supports previous observations by Lin et al. [34] and Sengupta and Haythornthwaite [53]: authors on Stack Overflow indeed tend to discuss technology in an objective, non-sentimental way. Our analysis of 25 randomly selected (only tag-filtered) posts indicates that authors who express sentiment when asking questions about testing topics on Stack Overflow are more often expressing negative sentiment than positive. Out of those 25 posts, not a single one contained positive sentiment. In the light of those observations we argue that sentiment analysis indeed supported the goal to extract a subset of posts that contains both positive and negative sentiment. Deliberately extracting positive and negative sentimental posts provided an improvement in terms of balance in sentiment. In other words: a random selection would have only provided very few positive posts. However, we do not consider our approach applicable for quantitative studies where results and implications are directly discerned from the output of sentiment analysis tools. The accuracy of predictions for sentiment was simply not accurate enough to provide meaningful insights when only evaluating numbers. Posts predicted as positive and negative only turned out to be correct in 50% of all cases (50 out of 100). In 5 cases the sentiment was even the opposite of what was predicted. We also learned that the sentiment analysis pipeline is most accurate in identifying neutral posts. Out of 25 samples that were predicted to be neutral only 2 contained sentiment. Depending on the research question, an approach to identify content with neutral sentiment could therefore yield good results. We identified that 28 posts of our dataset were too short for meaningful analysis. For studies similar to ours we recommend to exclude short posts. Posts are more likely to contain subjective opinions and valuable content, when they contain more than 2 paragraphs of text.

How and why is sentiment expressed
We set out with our analysis of Stack Overflow posts to investigate how practitioners express sentiment in the context of software testing and which factors play a role when sentiment is expressed. We identified 22 codes which describe different expressions that are used by practitioners on Stack Overflow.

RQ1
How do software engineers express sentiment about testing on Stack Overflow?
In sentimental posts on Stack Overflow practitioners are referring to external information like blogs or documentation, they reassure readers, share their ambition and subjective judgement of the value of testing practices and tools, compare different approaches, inquire for workarounds or new ways to solve a problem, admit their own lack of knowledge and their mistakes, reflect experiences, contemplate failure and sometimes exclude solutions that could solve their issues. Sentiment is expressed when desperation, unexpected behavior, uncertainties, complex issues, missing capabilities, or a willingness to improve is described.
The categorization of posts has allowed us to take our analysis beyond the level of expressions. We developed the four mayor categories discouragement, exploration, reflection, and aspiration, which illuminate factors that can lead to sentimentality.

RQ2
Which factors affect sentiment of software engineers towards testing practices?
Lack of experience and knowledge, especially in complex environments is often indicated in posts with negative sentiment on Stack Overflow, when practitioners describe discouraging experiences. Trust and confidence into practice and understanding of long term goals on the other hand gives practitioners space for aspiration, expressed with positive sentiment. Practitioners who explore testing express negative sentiment when they experience unexpected behavior and positive sentiment when they are inspired by resources like books and blog entries. When reflecting on their practice, an awareness of their own blind-spots allows practitioners to be positive, while ambiguity, when practitioners are completely in the dark, is reflected negatively.
Going beyond this analysis which highlights factors that lead to sentiment, we presented a preliminary theory that suggests how those factors go hand in hand in manifesting sentiment around testing. The preliminary theory also describes situational elements that seem to lead to sentiment.

Preliminary Interpretive Theory
On Stack Overflow we see complexity and aspiration as important factors that make people ambitious about testing. Complexity of projects can make manual testing impossible and motivates (or forces) practitioners to use testing. Trust and confidence in testing practices on the other side makes people aspire to pursue long term goals using testing practices. In both cases experience and knowledge influences whether this ambition leads to a positive or negative experience.

Implications
The results of our analysis of Stack Overflow posts about software testing carries implications for education of software developers, and management of software development teams. Based on the data we have seen, we hypothesize that the implementation of automated testing practices in simple projects, when manual testing is still possible, could allow an iterative development of testing skills while reducing the likelihood of discouraging experiences. Having obtained these skills, we argue, would then also influence the experience of testing complex systems in a positive way. Rejecting or approving this hypothesis could help to clarify the role that teaching of software testing can have in the early stages of software engineering careers (e.g., in undergraduate courses of universities). Connected to this hypothesis, our preliminary theory suggests that on Stack Overflow, testing practices are perceived as especially valuable when the complexity of a software project grows. Refining and testing this theory in other contexts could generate new insights into how practitioners and students of software engineering can be motivated to learn software testing. Pham et al. [47] for example identify the same issue in a study with bachelor students. Their study confirms that the perception of the complexity of code affects students' motivation to practice testing. They also report that students see the cost of testing but fail to understand its benefit as projects are often not critical or complex enough. (Re-)introduction of testing practices, when complex software development methods are taught, so we hypothesize, could teach students the value of software testing. Introducing testing practices like mocking in the context of distributed systems and socket programming is one example. Regarding managers of software engineering teams, our preliminary theory implies that giving employees time and space to develop simple test cases for simple projects is beneficial. Being comfortable with simple test practices, practitioners seem to gain confidence and trust. As a recommendation that should be tested in future work, we suggest that the development process should allow a steady increase of complexity instead of tackling huge challenges directly. The words of one author reflecting his work in a project where they introduced testing echoes this last implication of our interpretation: "While I no longer work on this project [...], I think it gave me some enormous insight into how bad some projects can be written, and steps one developer can take to make things a lot cleaner, readable and just flat out better with small, incremental steps over time."(1064403 )

Threats to validity
Our systematic analysis of 200 Stack Overflow posts has led to insights that have enabled us to formulate preliminary hypotheses to answer our research questions and an interpretive theory. In this section we present the threats to the validity of our findings.

Internal Validity
To select samples from the Stack Overflow dataset we filtered using userassigned tags and the sentiment analysis tools SentiCR and RoBERTa. The dataset from Lin et al. [34], which we used to train the tools, was evaluated by Zhang et al. [62], who report macro-and micro-averaged F1-scores of 0.59 and 0.82 for SentiCR and 0.80 and 0.90 for RoBERTa respectively. We combined both tools to reduce inaccuracy as suggested by Zhang et al. [62]. We only selected posts that were classified with the same sentiment polarity by both tools. We checked the accuracy of the filtering approach by including and evaluating two groups of test samples in our analysis (25 random and 25 neutral posts) and classifying the sentiment of each post. Even though the precision of the tools combined provided only a 50% accuracy for positive posts, we argue that the inaccuracy does not pose a threat to our results. The results presented in this paper were produced by deep and thorough qualitative analysis for which the sentiment analysis was only a tool to narrow down the focus. The accuracy has no direct influence on the results of our analysis. To avoid mistakes in the implementation of the sentiment analysis tools, we used the open-source implementation of SentiCR from the replication package of Zhang et al. [62] 10 , and the open-source library PyTorch 11 which provides an implementation of roBERTa.
To extract posts from the dataset that are relevant to software testing we extended an existing open-source tool 12 . With our extension of the tool we first filtered for all post with a tag that includes the word testing. We then generated an include list of tags by manually removing all irrelevant tags that occurred in this subset of posts. Starting with a generic wild-card and then snowballing to generate a more accurate list of tags was found to be a valid method by Yang et al. [59]. Errors in the implementation of the filtering tool and mistakes during the manual selection of tags pose a possible threat to the validity of our results. To reduce the chance of implementation errors we only made minimal changes to the open-source software that was used for filtering. To minimize errors in the manual tag selection process, the final list was reviewed by two software engineering researchers who were otherwise not involved in this study.

Experimenter Bias
We took measures to ensure that the influence of the authors' subjectiveness on the results of this paper stays within the boundaries of what is reasonable and expected in the context of a constructivist GT study. It is possible that the authors made mistakes in the interpretation of the dataset. To reduce the likelihood of a misinterpretation that would pose a threat to the validity of our results, the interpretation of the data recorded in memos and developed into codes, categories and theory was discussed between the first and second author. Disagreements were resolved in a cooperative manner. We do not provide a quantitative analysis of this process of reliability verification as such an analysis would suggest a level of objectivity that we do not want to claim [40]. Aligned with our epistemological stance and the interpretive nature of constructivist GT, we instead acknowledge our biased perspective. Instead of claiming a high level of absolute objectivity, we argue that taking the view from nowhere, would not be appropriate to answer the research question that we propose. Instead, we present a transparent account of the grounds on which our interpretation rests. We use pertinent quotes and provide references to original documents whenever we explain our interpretations. The reader is invited to go through all the references in the text and the analyzed material that we provide with our replication package. Inspection of the material should reveal to the reader that we only make the material to speak for itself [56, coded-dataset.qdpx]. High involvement with the data, enabled by following the systematic strategies of constructivist GT, and not our preconceptions led to what we present in this paper.
We use sentiment analysis tools to filter the Stack Overflow dataset. It allowed us to narrow down the dataset to what is relevant for our study. To ensure that our own, manual evaluation of sentiments of posts and expressions is not biased by the outcome of this tool-based classification, documents were analysed in random order and the results of the tool's classification were hidden during analysis.

External Validity
Qualitative research searches for a deep understanding of the particular. Knowledge generated from such research is context dependent. We therefore can not claim that the preliminary result that our analysis produces has a high external validity that goes beyond the scope of the Stack Overflow community. Stack Overflow posts, which are non-interactive documents, cannot provide a full or thick description of sociological circumstances [20,24]. In other words: Stack Overflow posts only provided us a shallow view of the circumstances that practitioners experience; there are many things we are unable to see through an analysis of Stack Overflow posts. By sharing our preliminary interpretive theory we motivate inquiries that add more depth. More in-depth inquiries that either challenge the generalizability of what we have learned on Stack Overflow, or extend on it to fit a broader context than the one we investigated. To broaden the context of the posts, we considered comments, edits, and links that are referred to in posts and evaluated post's edit-history and the profiles of users that posted content. Further, the conclusions that allowed us to construct the results of this paper are based on the qualitative analysis of a small part of the full Stack Overflow dataset. As analyzing the full dataset is not feasible, we choose to focus our analysis on a fraction of sentimental posts. By not analyzing the whole dataset we risk to miss details that could lead to different interpretations and hence different theories. We reduced this risk by consecutively adding posts to our analysis until we reach a point, when the analysis of further posts does not reveal any new answers to the research questions we pose. Our analysis concluded in this way after reviewing 200 posts.

Construct validity
We investigate the role of sentiment in software testing posts to learn about the experience of software developers when they practice software testing. We use sentimentality as a construct and proxy to analyze content that goes beyond technical discussions and touches on this experience. By analyzing sentimental Stack Overflow posts we infer interpretations about how sentiments come about and how they affect testing practices. The root causes for sentiment of practitioners are manifold and might be due to variables which we were not able to consider in our investigation. This poses a threat to the validity of our results. We reduced this threat by analyzing the data qualitatively, taking contextual information of posts like comments, edit history and the time it took for the question to be answered into account. We are therefore not only relying on sentimentality as a variable to understand what affects practitioners.

Future Work
The analysis described in this paper brought us closer to understanding what arouses sentiment in practitioners in the context of testing. However, as mentioned in the threats to external validity, the implications we present need to be taken with a grain of salt. Before suggesting which steps can be taken to raise our work to a higher level of maturity, we reflect on the limitations of the analysis presented in this paper.

Limitations
Stack Exchange, the parent website of Stack Overflow, provides insights about Stack Overflow by conducting an annual user survey. Their surveys' results and independent research about diversity on the platform reveals that the user base lacks diversity when it comes to ethnicity and gender [19]. In their own report it is stated that people of color are underrepresented among professional developers on Stack Overflow and that the company has considerable work to do, to ensure the platform is inclusive 13 . According to Vadlamani and Baysal [57], and Zagalsky et al. [61] it is not only ethnicity and gender, but also professional factors that are strong reasons for (a lack of) engagement in the community. They lead to an expert-bias as novice contributers may even be confronted with subtle or overt bullying on Stack Overflow. Another bias is introduced through strict community guidelines 14 . During our investigation we were directly confronted with this limitation. Two posts that were rich in sentiment were closed because they violate the community guidelines. In one of those post, the message posted by a moderator reads: "as it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion"(16938742 ). In the other post, an author who has "been banging [their] head against the wall trying to understand [...] concepts for a week"(2978843 ) simply suggested a "very understandable and simple" explanation so that others can also enjoy an 'aha' moment. Examples like this make it evident that practitioners cannot express themselves freely on Stack Overflow. When they post exclusively sentimental content or ask questions that provoke discussion, they are sanctioned. The aforementioned post also suggests another limitation: practitioners posting on Stack Overflow are biased towards negativity. What is discussed on Stack Overflow are problems. If there is no problem to solve, the post is closed. Success stories or exclusively positive accounts of practitioners on Stack Overflow are therefore rare.

Theoretic sampling
Early stages in grounded theory are supposed to open up discussion and motivate for focused inquiries to follow. Theories mature as they are refined and backed by collection and analysis of more data. In grounded theory, this crucial process is called theoretic sampling [16]. Apart from refining, verifying or rejecting our theory, such a focused collection of samples can answer questions that we derive directly from our analysis. 1. How does the social context of individuals affect sentiment of testers when they are exploring or reflecting experiences? 2. Which role does the experience of peers play in shaping the testing experience of individual practitioners? 3. How do practitioners express sentiment about testing in informal settings? 4. How do practitioners express sentiment online, when ambiguous and sentimental content which provokes discussion is not sanctioned but encouraged?
In order to investigate the above questions, we propose different approaches. Through a quantitative analysis of Stack Overflow, Alshangiti et al. [2] revealed that different challenges in the field of machine learning are present because implementation of application requires a wide set of skills. More concretely, they suggest that data preprocessing is especially challenging as it is often overlooked in education of practitioners. A quantitative content analysis like the one of Alshangiti et al. [2] about testing posts on Stack Overflow could identify aspects of testing that are difficult to handle for practitioners on a more technical level. Further qualitative studies of non-interactive documents from platforms like Reddit 15 or Twitter 16 , which encourage sentimental and ambiguous content, can complement our analysis on a non-technical level. Conducting a meta analysis of publications on socio-technical aspect of software testing is another way of grounding our work in more theoretical and empirical data that others investigated in the past. But most crucially, we want to meet practitioners where they are confronted with testing practices. Field studies in which individuals or groups of practitioners are observed and interviewed during practice can provide insights that go beyond what non-interactive documents can 15 Reddit /r/softwaretesting 16 Twitter #softwaretesting reveal. Direct observations of practitioners will provide crucial insights into lived experience that allow the formulation of a mature theory.

Related Work
With our investigation of sentimental posts on Stack Overflow, the categorization of posts and the development of a preliminary theory we highlighted different aspects that influence motivation of practitioners, the effect of emotions on practice, and the role of software testing as a part of software development. In this section we relate our findings to what others have uncovered in relation to those topics.
A study by Graziotin et al. [22] emphasizes the detrimental effects that unhappiness can have on software engineering practitioners. Some of what they describe what happens when developers are (un)happy is relevant to our paper. According to their report, developers distance themselves from tasks to which their unhappiness relates. Our analysis reveals that confrontation with testing can under some circumstances cause negative feelings of discouragement. Discouragement can thus lead to withdrawal from testing resulting in process deviation and reduced code quality. On the positive side, findings of Graziotin et al. [22] show that emotions related to happiness like aspiration increase process adherence and stimulate creativity, leading to a stronger commitment to writing tests.
A literature review by Beecham et al. [10] compares the findings of 92 papers about the topic of motivation of software engineers from the 1980s to 2006. The review highlights that software engineers display a very high need for growth and that they are concerned about learning new technology. Software engineers are motivated by the exploration of new techniques and want to work on identifiable pieces of quality work. According to the review, problem solving and the confrontation with challenges can be an enhancing factor for motivation. While those factors are present in many studies, the literature review concludes that the needs of software engineers are highly dependant on the context of individuals. Our study confirms this conclusion. Exploration can increase motivation or ambition in the case of software testing, but we indeed see that whether or not challenges or exploration lead to increased motivation highly depends on context. Contrary to the studies included in the review, we see that a confrontation with challenges can also lead to discouragement. Our results on this aspect are more aligned with the results of a qualitative study by Sharp et al. [54], that suggests that challenges, even when mentioned as a reason to stay in the job, are not so much a factor that gives practitioners satisfaction. Not challenges, but creativity and being able to make a difference is what makes software engineering worthwhile [54]. Similarly, Meyer et al. [42] found out that on good workdays, developers make progress and create value for projects they consider meaningful. On good days, they spend their time efficiently, with little administrative work, and infrastructure issues; what makes a workday typical and therefore good is primarily assessed by the match between developers' expectations and reality [42]. Two things here relate to our own findings.
First, we also find that practitioners who already identify testing as good and meaningful practice, for example because they are motivated by books or blogs about testing, are indeed ambitious and aspirational about testing. Second, we also see that challenges created by infrastructure issues, for example in complicated development environments lead to discouragement because of unexpected behavior. With a survey study conducted in multiple companies Runeson [49] also found supporting evidence for the negative impact of unexpected challenges caused by complexity. A good integration of unit testing into the internal tool landscape that is provided by the company is key for the adoption of testing. However this integration is especially hard when the modules under test interact with a complex system state or a complex system environment [49]. When an integration of testing into practice is too challenging it is mostly perceived as de-motivating for software developers. In this context, Daka and Fraser [17] report that practitioners rank the isolation of testing code as one of most challenging tasks. Crucially, it is perceived as a difficult challenge more often by novice software developers. We see the same in our investigation. Our analysis suggests that inexperienced practitioners are often discouraged from testing by complicated environments in which an isolation of the method under test becomes difficult. On the other hand, aligned with our results, Pham et al. [47] identified that novice developers adjust their testing effort according to the perceived complexity of code. A project has to be complex to warrant testing to be beneficial [47]. Complexity can thus, as we saw on Stack Overflow as well, be a motivating factor. Pham et al. [47] and Daka and Fraser [17] also report that developers' feelings about unit testing are often negative. Concretely, only half of the practitioners interviewed by Daka and Fraser [17] had positive feelings about testing and students interviewed by Pham et al. [47] were not fond of testing because to them writing tests did not feel like an accomplishment. Some students even developed an anxious attitude towards testing. This aligns with our observation in so far that we saw an overwhelming amount of negative posts in random samples. A general negative bias towards testing could therefore also be an explanation for the high amount of negative post that we saw in our dataset.
In relation to Sharp et al. [54] and Meyer et al. [42] and their finding that meaningful contributions and being able to make a difference are important. However, from our own work it is not evident that testing in itself is always recognized as a meaningful contribution to projects by practitioners and their peers. Positive ambitions mentioned in posts on Stack Overflow mostly seem to be self-aroused for example through engagement with inspiring resources like books or blogs. Daka and Fraser [17] indeed identified that peer pressure is only rarely mentioned as a motivating factor to write unit tests; the driving force for a developer to use unit testing is supposedly their own conviction.
Finally, Kasurinen et al. [31] investigated how new testing practices are adopted by companies and found out that when confronted with new techniques that could improve testing processes, most companies are not interested in adoption if there is no first-hand knowledge in the team or company. Only rarely they do give new practices a try, and if they do, they only evaluate new techniques in small projects. However Kasurinen et al. [31] also report that companies adopt new techniques when clear need arises. According to the theory they propose in their study, development of processes only happens when the existing process obviously has a need to develop; required resources for adoption of new practices like testing need to be justified. Our preliminary theory has at its core this very point. We observe on Stack Overflow, that an increase of complexity of a project leads to spontaneous adoption of testing practices. While it is not clear from the report of Kasurinen et al. [31], what the motivation or rational reason of a company that evaluates testing practices in small projects is, a suggestion could be taken from our own study. We suggest that evaluation of techniques in small projects leads to an advantage when the need for those techniques can no longer be ignored. In other words, first-hand knowledge should develop in a company before it is really needed.

Conclusion
In this study we set out to understand the sentiments of software engineers regarding software testing in the context of the popular question and answer platform Stack Overflow. In order to do so, we have used a semi-automated approach to detect sentiment in Stack Overflow posts. In particular, we start out by using automatic sentiment analysis tools to classify posts, after which we perform an in-depth, qualitative analysis.
Through this in-depth study of 200 posts we find that developers are in fact sentimental about software testing on Stack Overflow; we find that they express their sentiment when unexpected behavior, uncertainties, complex issues, missing capabilities, or a willingness to improve is part of the post. Additionally, we have observed that lack of experience and knowledge, especially in complex environments can lead to a negative sentiment. On the other hand, software engineers express positive sentiment when they have trust and confidence in their practice, especially if they have an understanding of long term goals of their projects.
Through the observations that we have made, we construct a preliminary interpretive theory that explains how a projects' complexity and the tacit knowledge of individuals shapes the experience and attitude of practitioners in the context of software testing. Practitioners, we argue, get motivated to practice software testing as the complexity of their project increases. Reaching that point without enough knowledge of testing practices leads to discouraging experiences. We argue that testing practices are also seen by practitioners as something to aspire to, especially when considered for example in the context of long term goals. This has implications for both the education of software engineers, and for managing software development teams that engineer complex software. Our findings suggest that taking both motivation and complexity into account in future studies of software testing practices can reveal more about practitioners' sentimental perspectives. Our preliminary results show that an investigation of the motivation and capabilities of software engineers to engage in effective testing practices needs to go beyond the analysis of technical tools and their usage.
We acknowledge that we need to extend and deepen our interpretive theory, and our overall understanding of software engineers' sentiments towards testing. In particular, in our future work we envision to study the social context and its relation to sentiment, the connection to the experience levels of software engineers, their sentimental expressions in informal settings, and finally how project management culture influences attitudes and motivation of individual software engineers in the area of testing.