1 Introduction

An important aspect of research related to human aspects in software engineering (SE) is human cognition. Human cognition is related to human thought processes. When a thought process is based on illogical reasoning, it might negatively impact the performed task (Calikli and Bener 2014). Formally, this phenomenon is known as cognitive bias and is defined as “...cognitions or mental behaviours that prejudice decision quality in a significant number of decisions for a significant number of people” (Arnott 2006, p. 59). Since the introduction of the concept by Tversky and Kahneman (1973), a considerable amount of work has been carried out to investigate cognitive biases in psychology (Gilovich et al. 2002).

Confirmation bias, a type of cognitive bias, is the tendency to look for confirmatory evidence rather than look for disconfirming information or evidence that refutes prior beliefs (Arnott 2006). The term confirmation bias was first used by P.C. Wason in his rule discovery experiment (Wason 1960). Confirmation bias belongs to the category of confidence biases, according to the taxonomy proposed by Arnott (2006). From Nickerson’s point of view, confirmation bias is among the top reasons for problematic human reasoning (Nickerson 1998). Multiple psychological studies have investigated and confirmed the presence of confirmation bias attributed to multiple antecedents and how substantial and strong their effects are (Nickerson 1998; Ask and Granhag 2007; Hernandez and Preston 2013). For example, psychological studies have explored and documented time pressure as an antecedent to the manifestation of confirmation bias, e.g. time pressure caused criminal investigators and jurors to stick to their initial beliefs while evaluating witness’ testimonies (Ask and Granhag 2007) and deciding on the verdict (Hernandez and Preston 2013) respectively, which affected their evaluations. Time pressure in the software organisational context is a substantial and unavoidable phenomenon that is usually perceived to have a negative influence, e.g. it causes developers to take shortcuts (Mäntylä et al. 2014). Studies in SE investigating the effects of time pressure report both positive and negative effects, e.g. time pressure was identified to demotivate the process of software improvement (Baddoo and Hall 2003), suspected to deteriorate quality in the development cycle and global software testing (Wilson and Hall 1998; Shah et al. 2014), and improve the efficiency of requirements reviews and test case development (Mäntylä et al. 2014).

Software engineering is also subject to the effects of cognitive biases, especially to those of confirmation bias (Stacy and MacMillan 1995). In this paper, we limit the scope of the investigation to functional software testing. The manifestation of confirmation bias among software testers is considered based on their inclination towards designing test cases that confirm the correct functioning of an application rather than scenarios that break the code (Calikli and Bener 2014; Teasley et al. 1994; Salman 2016). In this regard, limited empirical evidence is available in the SE literature that reports confirmation bias manifestation in software testing resulting in deteriorated software quality, e.g. higher defect rate (Calikli and Bener 2010) and an increased number of production defects (Calikli et al. 2013).

Despite the recognition of time pressure and confirmation bias’ effects, a research gap exists for studying time pressure as an antecedent to confirmation bias in software testing. It is a combination that may exacerbate software quality and deserves empirical attention per evidence available from the psychology discipline (Ask and Granhag 2007; Hernandez and Preston 2013). According to Mäntylä et al. (2014) and Kuutila et al. (2017), the work on the effects of time pressure in software testing is still scarce and requires further investigation. In addition to this, the motivation for studying time pressure as a potential antecedent to cognitive biases is based on the results of our systematic mapping study (SMS) on the topic of cognitive biases in SE (Mohanani et al. 2018).

Therefore, our goal is to examine the manifestation of confirmatory behaviour by software testers and the role of time pressure in its promotion. Specifically, the purpose of this study is to investigate whether testers exhibit confirmatory behaviour when designing functional test cases and how it is impacted by time pressure. We achieve this goal by conducting a controlled experiment with graduate students where we observe confirmation bias in designing functional test cases in terms of their consistency and inconsistency with the requirements specification provided. The design of our experiment is one factor with two levels, i.e. time pressure vs no time pressure. Our results confirm previous research findings: participants designed considerably more confirmatory test cases than disconfirmatory ones. We also observed that time pressure did not increase the rate of confirmatory behaviour in the group working under time pressure.

This study enhances the body of SE knowledge through the following contributions:

  • An experimental design studying the effects of time pressure on confirmation bias, a novelty in SE within the context of software testing.

  • A new perspective when compared to earlier work in defining the (dis)confirmatory nature of tests, i.e. consistent (c) and inconsistent (ic) test cases. The change in terminology and the need for it are elaborated in Section 3.2.

  • Empirical evidence in support of confirmatory behaviour in functional test case design.

  • Empirical evidence suggesting that time pressure may not increase the rate of confirmatory behaviour in functional test case design.

Section 2 presents the related work on time pressure and confirmation bias. In Section 3 the research method and experiment planning are explained in detail, which is followed by details regarding the experiment execution in Section 4. The results are presented in Section 5 and discussed in Section 6. In Section 7, we describe the threats to validity. Finally, Section 8 concludes our work and discusses possible future extensions.

2 Related Work

In this section, we present a brief overview of the SE and psychology literature on confirmation bias, its detection/measurement, and time pressure.

2.1 Confirmation Bias in Software Testing

The studies reviewed in this section were found as a result of our SMS (Mohanani et al. 2018). Most work on cognitive biases in the context of software quality and testing is focused on confirmation bias (Mohanani et al. 2018; Salman 2016). Another term related to confirmation bias is positive test bias or positive test strategy, found in the works of Teasley et al. (1994) and Leventhal et al. (1994). Leventhal et al. (1994) refer to the work of Klayman and Ha (1989) and mention that the phenomenon of positive test bias is also called confirmation bias. According to the definition of confirmation bias, the more the bias exists among the testers the more it negatively affects the testing (Calikli and Bener 2013, 2014; Leventhal et al.1994).

In 1994, Teasley et al. conducted an experimental study to investigate the effects of positive test strategy on functional software testing. They also investigated the impacts of expertise level and the level of detail of specifications on positive test strategy. They found that the use of a positive test strategy decreased with higher expertise level, but the influence of the completeness of the specifications remained inconclusive because only one experiment showed support for it. In another study by the same authors Leventhal et al. (1994), an additional factor (error feedback), along with the previous factors, was studied with the same objective. The results of the study revealed that complete specifications, as well as higher expertise, may aid in mitigating positive test bias. The effect of error feedback, however, remained inconclusive, which the authors linked to the types of software used in the studies (Leventhal et al. 1994). In this family of experiments, Teasley et al. (1994) and Leventhal et al. (1994) recruited senior level and graduate students in computer science to represent advanced testers.

With the aim of finding evidence for positive test bias, Causevic et al. (2013) conducted a test-driven development (TDD) experiment in industry. The authors found a significant difference between the number of positive and negative test cases developed by the participants. In addition, negative test cases were found to have a greater possibility to detect defects compared to positive test cases. The authors measured the quality of the test cases by the quality of code for each test case and observed differences in the quality of the negative and positive test cases (Causevic et al. 2013).

In recent years, multiple studies have been conducted by Calikli and Bener to explore several factors that might affect confirmation bias (Calikli and Bener 2010, 2014; Calikli et al. 2010a, 2010b). For instance, in a preliminary analysis, the authors studied company culture, education and experience as factors and found that only company culture affected confirmation bias levels (Calikli et al. 2010b). Later, Calikli and Bener (2010) studied the effects of experience and activeness (in development and testing), along with the logical reasoning skills gained through education, on confirmation bias. In 2014, they extended their set of factors to include job title (developer, tester, analyst, and researcher), education, development methods and company size (small and medium enterprises [SME] vs large-scale companies) (Calikli and Bener 2014). Both studies showed no effect of either development methods or experience in development and testing on confirmation bias, but as an effect of logical reasoning skills, low bias levels were observed (Calikli and Bener 2010, 2014). Low confirmation bias levels were observed in participants who were experienced but inactive in development or testing (Calikli and Bener 2010). No effect of company size, educational background (undergraduate) or educational level (bachelor’s, master’s) were observed (Calikli and Bener 2014). However, the job title (researcher) was associated with significantly lower levels of confirmation bias when compared to the rest of the job titles, which is possibly due to researchers’ acquired critical and analytical skills (Calikli and Bener 2014).

2.2 Measurement of Confirmation Bias

The first approach identified in the software testing literature uses test artefacts to measure the manifestation of confirmation bias (Teasley et al. 1994; Leventhal et al. 1994). These studies measured positive test bias by mapping it onto functional software testing as: if testing is done with the data provided in the specifications, then testing is done in a hypothesis-consistent way, but if testing is done with the data outside the specifications then it is in hypothesis-inconsistent way. Leventhal et al. (1994) used equivalence classes and boundary conditions to classify test cases as positive or negative. According to Leventhal et al. (1994), positive test cases deal with valid equivalence classes and negative test cases with invalid equivalence classes. Causevic et al. (2013), on the other hand, declared a test case to be negative if it exercised a program in a way that was not explicitly declared in the requirements specifications and the test cases testing the implicitly forbidden behaviour of the program were also considered negative.

The second approach is explained in detail by Calikli et al. (2010a), who derived their measurements from psychological instruments based on the work of Wason (1960). Rather than measuring any testing artefact produced by the participants, the authors used Wason’s Rule Discovery and Selection Task, after some modifications, to assess how people think. For Wason’s Selection Task, the questions that were related to the software domain required an analysis of a software problem that was independent of programming tools and environment. (Calikli and Bener 2010; Calikli et al. 2010b)

2.3 Time Pressure

This section focuses on the factor of time pressure from two perspectives. First, we present the SE studies that have considered time pressure in different SE contexts. Then, we shift the focus to the psychology studies discipline that motivate our study.

2.3.1 Software Engineering

Table 1 presents the studies that examined or observed time pressure (TP) in different contexts of SE. All the studies listed in the table are selected from two recent studies by Mäntylä et al. (2014) and Kuutila et al. (2017) based on the criterion that a study belongs to the SE domain.

Table 1 Time pressure (TP) studies in SE in chronological order

The first column in Table 1 names the study and lists its type, i.e. either qualitative or quantitative. The second column mentions the SE context and the objective of the study, whereas the third column presents the results of the respective study from the perspective of the time pressure factor. In the table, five out of twelve studies are related to software quality and software testing and four are related to software development. Three studies in Table 1 reported that time pressure did not have a negative effect, e.g. impact, on task performance (Topi et al. 2005). The studies in Table 1 also report the positive effects of time pressure, e.g. improved effectiveness of the participants when not working as individual testers (Mäntylä and Itkonen 2013). Finally, time pressure has been found to be a negative factor in the investigated contexts by Baddoo and Hall (2003) and Wilson and Hall (1998).

2.3.2 Psychology

This section elaborates on confirmation bias as a psychological phenomenon and on how psychology literature has operationalised time pressure to study its influence on confirmatory and motivational behaviours. The studies in this section report time pressure as a factor that negatively impacted the process/quality by influencing the manifestation of confirmation bias in the studied contexts. Of particular importance to our work, Ask and Granhag (2007) reported that the tendency of people to rely on prior beliefs rather than on external information increased under time pressure. Hence, we aim to take an interdisciplinary perspective.

Hulland and Kleinmuntz (1994) experimentally examined if greater time pressure causes decision-makers to rely on the summary evaluations retrieved from memory as aopposed to relying on information from the external environment as a way to avoid effort expenditure. The authors observed changes in the search and evaluation process but they did not observe greater reliance on summary evaluations. Hernandez and Preston (2013) carried out a study in the context of juror decision making that examined the effect of difficulty in processing available information (disfluency) with confirmation bias. They paired disfluency with other cognitive loads, including time pressure, and found that with the presence of both disfluency and time pressure, confirmation bias could not be overcome. The authors applied time pressure by restricting the submission of the verdict within a prespecified time frame. Another study was performed in the same context (investigative psychology) and studied the behaviour of criminal investigators. They evaluated witness’ testimonies as either confirming or disconfirming the central hypothesis (Ask and Granhag 2007). The results of the study revealed that participants under high time pressure were more likely to stick with their initial beliefs and were less affected by the subsequent evidence. The authors created high time pressure by orally informing the participants that they had a limited amount of time to complete the task. In addition to this oral instruction they were encouraged to complete the task faster and were informed when there was five minutes remaining.

The manifestation of confirmation bias under time pressure in investigative work can be related to its manifestation in functional software testing. For example, a juror reading a case description, where the defendant’s case is objectively described, is more likely to reach a guilty verdict when the same juror previously read a psychologist’s testimony highlighting the negative aspects of the defendant’s personality profile (Hernandez and Preston 2013). In this situation, the juror manifests confirmation bias in reaching the guilty verdict, which is tentatively formed (e.g. biased upon) after reading the psychologist’s testimony. Similarly, a software tester, preconditioned on what is provided in the specifications, may manifest a specification-confirming behaviour in testing. The tester, under time pressure, is more likely to follow the confirming attitude and may not necessarily consider, or give priority to, testing the other scenarios that may occur (e.g. corner cases) but are not specified in the requirements.

2.4 Research Gap

Previous work in the SE domain is limited to investigations of the effects of time pressure from productivity and quality perspectives. In other words, none of the past studies explored the impacts of time pressure on the artefacts produced by software engineers as a consequence of a certain cognitive bias; thus, they have neglected the human factors. Additionally, the studies that have been performed that consider the human aspects of software testing, particularly cognitive biases, lack the investigation of the time pressure factor as an antecedent. In comparison to the presented psychological studies, investigating time pressure’s effect on confirmation bias in an engineering discipline differentiates our study.

Consequently, to bridge the gap, we contemplate the manifestation of confirmatory behaviour along with the investigation of time pressure’s effect on the manifestation of confirmation bias in the artefacts prepared by software testers. Instead of measuring confirmation bias as found in the existing literature, our approach is to detect it from the produced artefacts (explained in Section 3.2). Furthermore, we use experimentation, which is method-wise similar to the research approaches taken by, e.g. Hernandez and Preston (2013), Mäntylä et al. (2014) and Teasley et al. (1994). To summarise, our study is novel in that it investigates the effect of time pressure on confirmation bias in software testing.

3 Research Method

In this section, we present the details of our experiment from the perspective of definition and planning - the first two stages of the experiment process (Wohlin et al. 2000). To enable further replications, we share the experimental protocol and scripts at: https://doi.org/10.5281/zenodo.1193955.

3.1 Goal

Our objective is to examine whether testers manifest confirmation bias (leading to a confirmatory attitude) during testing and whether time pressure promotes the manifestation of confirmation bias. The aim of our research, according to Goal-Question-Metric (Basili et al. 1994) is as follows: Analyse the functional test cases For the purpose of examining the effects of time pressure With respect to confirmation bias From the point of view of researchers In the context of an experiment run with graduate students (as proxies for novice professionals) in an academic setting.

Consequently, the research questions of our study are:

RQ1: Do testers exhibit confirmatory behaviour when designing functional test cases?

RQ2: How does time pressure impact the confirmatory behaviour of testers when designing functional test cases?

3.1.1 Context

We study the manifestation of confirmation bias in functional software testing and examine how time pressure promotes confirmation bias manifestation in the same context. The phenomenon is studied in the context of a controlled experiment in academic settings with first-year master’s degree students, enrolled in the Software Quality and Testing course at the University of Oulu, as proxies for novice professionals. We limit our investigation to functional (black box) testing, which was part of the curriculum of the aforementioned course. For the purposes of the experiment, we focus only on the design of functional test cases, not their execution. We aim for an implementation-independent investigation of the phenomenon, since we are interested in studying the mental approach of the study participants in designing test cases, which precedes their execution. Besides, in system, integration and acceptance testing, software testers (by job role/title) design test cases before the code actually exists, using the requirements specifications. Therefore, our scope is limited to determining the type (i.e. consistent or inconsistent with the specifications) of functional tests designed by the participants, rather than their execution or fault-detection performance. We use a realistic object for the task of designing functional test cases under time pressure and no time pressure conditions.

3.2 Variables

This section elaborates on the independent and dependent variables of our study.

3.2.1 Independent Variable

The independent variable of our study is time pressure with two levels: time pressure (TP) and no time pressure (NTP). To decide on the duration for the two levels, we executed a pilot run (explained in detail in Section 3.8) with five postgraduate students. It took 45 min, on average, for the pilot participants to complete the task. Accordingly, we decided to allocate 30 min for the TP group and 60 min for the NTP group to operationalise time pressure.

The timing of the task was announced differently to the experimental groups. The experimenter reminded the participants in the TP group thrice of the remaining time; the first reminder was after fifteen minutes had elapsed and the rest of the reminders were given every five minutes thereafter. This was done to psychologically build time pressure. In contrast, after the initial announcement of the given duration to the NTP group, no further time reminders were made. This is in line with how Hernandez and Preston (2013) and Ask and Granhag (2007) operationalised time pressure in their studies.

3.2.2 Dependent Variables

Our study includes three dependent variables, which are c - number of consistent test cases, ic - number of inconsistent test cases and temporal demand. We define these dependent variables as follows:

Consistent test case: A consistent test case tests strictly according to what has been specified in the requirements, i.e. consistency with the specified behaviour. In the context of testing this refers to: 1) the defined program behaviour on a certain input; and 2) the defined behaviour for a specified invalid input. Example: If the specifications state, “… the phone number field does not accept alphabetic characters...”, the test case designed to validate that phone number field does not accept alphabetic characters is considered a consistent test case.

Inconsistent test case: An inconsistent test case tests the scenario or the data input that is not explicitly specified in the requirements. We also consider such test cases that present outside-of-the-box thinking at the tester’s end inconsistent. Example: If the specifications only state, “… the phone number field accepts digits...”, and the application’s behaviour for the other types of input for that field is not specified, then the following test case is considered inconsistent: the phone number field accepts only the + sign from the set of special characters (e.g. to set an international call prefix).

In contrast to Leventhal et al. (1994), we do not consider a test case validating an input from an invalid equivalence class as inconsistent, as long as it is specified in the requirements. On the contrary, we consider it consistent, because the tester has exhibited a confirmatory behaviour by conforming to what s/he has been informed to validate. If test cases are classified using our consistent and inconsistent definitions and Causevic et al.’s (2013) positive and negative definitions, the results might be the same. However, the outside-of-the-box thinking is an additional aspect of our definition of inconsistent, which considers the completeness of the requirements specification in the light of the context/domain. Unlike Calikli et al. (2010a), we do not utilise any tests from psychology to measure confirmation bias. Instead, we detect its manifestation by analysing the test artefacts of the participants and do not directly observe how people think or what their thinking inclinations, in general, are.

We therefore introduce the terms consistent and inconsistent in order to distinguish the concept from previous forms of measurement of confirmation bias, based on the contradictory understandings of a positive and negative test case. For example, Leventhal et al. (1994) and Causevic et al. (2013) use the same terminology but measure confirmation bias differently. We believe that our proposed terminology is more straightforward to comprehend as compared to the potential ambiguity of positive/negative terminology.

Temporal demand: We use the NASA task load index (TLX) as a measure of task difficulty as perceived by the participants. We apply the same definition of temporal demand as defined in the NASA-TLX, which is the degree of time pressure felt due to the pace or tempo at which events take place (Research Group HP and Ames Research Center N 1987). Therefore, it captures the time pressure perceived by the participants in the experimental groups.

3.3 Data Extraction and Metrics

The section elaborates on the data extraction and metrics defined for capturing confirmation bias and temporal demand.

3.3.1 Proxy Measure of Confirmation Bias

We mark the functional test cases designed by the participants as either consistent (c) or inconsistent (ic). To detect the bias of participants through a proxy measure, we derive a scalar parameter based on (c) and (ic) test cases designed by the participants and the total count of (all possible) consistent (C) and inconsistent (IC) test cases for the given specification:

$$z =c/C - ic/IC$$
  • if z >0 ; participant has designed relatively more consistent test cases

  • if z <0 ; participant has designed relatively more inconsistent test cases

  • if z = 0 ; participant has designed a relatively equal number of consistent and inconsistent test cases.

The value of z is the rate of change in relative terms. It is the difference of consistent test case coverage and inconsistent test case coverage. It indicates one of the above three conditions within the range [− 1,+ 1]. In terms of confirmation bias detection, z = 0 means the absence of confirmation bias. If z is + 1, then it indicates the maximum manifestation of confirmation bias, because only consistent test cases are designed with complete coverage, and − 1 is an indication of designing inconsistent test cases only with complete coverage. We should note that although − 1 is an unusual case, indicating that no consistent test cases were designed at all, it depicts a situation in which no bias has manifested. This case is quite impractical to occur because it suggests that no test case validating the specified behaviour of the application was designed. For the purposes of measurement, we predesigned a complete set of test cases (consistent and inconsistent) based on our expertise from a researcher’s perspective. Our designed set comprised 18 consistent (C) and 37 inconsistent (IC) test cases. The total number of (in)consistent test cases are defined in absolute numbers in order to be able to compare and perform an analysis in relation to a (heuristic) baseline. In order to enhance the validity of the measures, we extended the set of our predesigned test cases after the experiment with the valid test cases (consistent and inconsistent) designed by the participants that were missing from our set. This improvement step is in line with Mäntylä et al. (2014), who mentioned that it helps to improve the validity of the results. As a result, our test set includes 18 (C) and 50 (IC) test cases in total.

3.3.2 Temporal Demand

In order to capture temporal demand (TD), we used the values of the rating scale marked by the participants on NASA-TLX sheets. The scale ranges from 0 to 100, i.e., from low to high temporal demand perceived for the task (Research Group HP and Ames Research Center N 1987).

3.4 Hypothesis Formulation

According to the goals of our study, we formulate the following hypotheses:

H1 states that: Testers design more consistent test cases than inconsistent test cases.

$$H1_{A}: \mu (c) > \mu (ic) $$

and the corresponding null hypothesis is:

$$H1_{0}: \mu (c)\leq \mu (ic) $$

H1’s directional nature is attributed to the findings of Teasley et al. (1994) and Causevic et al. (2013). Their experiments revealed the presence of positive test bias in software testing.

As an effect of time pressure on consistent and inconsistent test cases, our second hypothesis postulates the following:

H2: (Dis)confirmatory behaviour in software testing differs between testers under time pressure and under no time pressure.

$$H2_{A}: \mu ([c, ic]_{TP})\neq \mu ([c, ic]_{NTP}) $$
$$H2_{0}: \mu ([c, ic]_{TP}) = \mu ([c, ic]_{NTP}) $$

While H2 makes a comparison in absolute terms, the third hypothesis considers the effect of time pressure on confirmation bias in relative terms with the given coverage to the specifications (z). Accordingly H3 states:

H3: Testers under time pressure manifest relatively more confirmation bias than testers under no time pressure.

$$H3_{A}: \mu (z_{TP}) > \mu (z_{NTP}) $$
$$H3_{0}: \mu (z_{TP})\leq \mu (z_{NTP}) $$

The directional nature of H3 is based on the evidence from the psychology literature in which time pressure was observed to increase confirmation bias in the studied contexts (Ask and Granhag 2007; Hernandez and Preston 2013).

To validate the manipulation of the levels of the independent variable - time pressure vs no time pressure - we formulate a posthoc sanity check hypothesis that postulates:

H4: Testers under time pressure experience more temporal demand than testers under no time pressure.

$$H4_{A}: \mu (TD_{TP}) > \mu (TD_{NTP}) $$
$$H4_{0}: \mu (TD_{TP})\leq \mu (TD_{NTP}) $$

3.5 Design

We chose a one factor with two levels between-subjects experimental design for its simplicity and adequacy to investigate the phenomenon of interest, as opposed to alternative designs. Table 2 shows the design of the experiment, where ES stands for experimental session and TP and NTP stand for the time pressure and no time pressure groups, respectively. In addition, this design is preferable as it does not introduce a confounding factor for task-treatment interaction, thus it enables the investigation of the effects of the treatment and control on the same object in parallel running sessions.

Table 2 Experimental design

3.6 Participants

We employed convenience sampling in order to draw from the target population. The participants were first-year graduate-level (master’s) students registered in the Software Quality and Testing course offered as part of an international graduate degree programme at the University of Oulu, Finland, in 2015. The students provided written consent for the inclusion of their data in the experiment. All students were offered this exercise as a non-graded class activity, regardless of their consent. However, we encouraged the students to participate in the experiment by offering them bonus marks as an incentive. This incentive was announced in the introductory lecture of the course at the beginning of the term. In the data reduction step, we dropped the data of those who did not consent to participate in the experiment. This resulted in a total of 43 experimental participants.

Figure 1 presents the clustered bar chart showing the academic and industrial experience of 43 participants in software development and testing. Along the yaxis are the percentages depicting the participants’ range of experience in the categories presented along the xaxis. The experience categories are: Academic Development Experience (ADE), Academic Testing Experience (ATE), Industrial Development Experience (IDE) and Industrial Testing Experience (ITE). The four experience range categories are less than 6 months, between 6 months and one year, between 1 and 3 years and more than 3 years.

Fig. 1
figure 1

Experience of participants

More than 80% of the participants have less than 6 months of testing experience both in academia and industry, which is equivalent to almost no testing experience. This indicates that our participants have much less experience in testing when compared to development. The second highest percentages of experience are in the >=1y and <3y range, except for in ITE. The pre-questionnaire data also shows that 40% of the participants have industrial experience, i.e. more than 6 months. Thirty-two percent marked their industrial experience in development and testing based on the developer or tester roles rather than considering testing as part of a development activity.

Considering our participants’ experience and the degree in which they are enrolled, we can categorise them as proxies for novice professionals according to Salman et al. (2015).

3.6.1 Training

Before the actual experimental session began, the students enrolled in the course were trained on functional software testing, as part of their curriculum, in multiple sessions. They were taught about functional (black box) testing over two lectures. In addition, one lecture was reserved for an in-class exercise where the students were trained for the experimental session using the same material but with a different object (requirements specification) to gain familiarity with the setup. Specifically, the in-class exercise consisted of designing functional test cases from a requirements specification document using the test case design template that was later used in the actual experiment as well. However, different from the actual experimental session, students were not provided with any kind of requirements supporting screenshot, which was available during the experimental sessions. One of the authors discussed students’ test cases in the same training session for giving feedback. In every lecture, we specifically taught and encouraged the students to think up inconsistent test cases. Table 3 shows the sequence of these lectures with the major content related to the experiment. We sequenced and scheduled the lectures and the in-class exercise to facilitate the experiment.

Table 3 Training sequence

3.7 Experimental Materials

The instrumentation that we developed to conduct the experiment consisted of a pre-questionnaire, the requirements specification document with a supporting screenshot, the test case design template and a post-questionnaire. Filling in the test case design template and post-questionnaire are both pen-and-paper activities, whereas the pre-questionnaire was administered online. The experimental package consisting of all the above mentioned materials is available from this URLFootnote 1.

3.7.1 Pre/Post-questionnaires

Having background information on the participants aids in their characterisation. The pre-questionnaire was designed using an online utility and collected information regarding participants’ gender, educational background, and academic and industrial development and testing experience. Academic experience concerned collecting information regarding the development and testing performed by the participants as part of their degree programme courses on development and testing. Industrial experience questions collected information on the testing and development performed by the participants in different roles (as a developer, tester, designer, manager and other).

We used hardcopy NASA-TLX scales to collect post-questionnaire data for two reasons. The first reason is that it is a well-known instrument for measuring task difficulty as perceived by the task performer (Mäntylä et al. 2014). Second, one of the load attributes with which the task difficulty is measured, i.e. temporal demand, captures the time pressure experienced by the person performing the task. In this respect, it was useful to adopt NASA-TLX because it also aided us in assessing (i.e. H4) how well we administered and manipulated our independent variable, i.e. time pressure in terms of the temporal demand felt by the participants.

3.7.2 Experimental Object

We used a single object appropriate for the experimental design. The task of the participants was to design test cases for the requirements specification document of the experimental object, i.e. MusicFone, that has been used in many reported experiments as a means of simulating a realistic programming task, e.g. by Fucci et al. (2015). MusicFone is a GPS-based music playing application. It generates a list of recommendations of artists based on the artist currently being played. It also displays the artists’ upcoming concerts and allows for the planning of an itinerary for the user depending upon their selection of artists and based on the user’s GPS location. Hence, choosing MusicFone as an object is an attempt to mitigate one of the non-realism factors (non-realistic task) of the experiment, as suggested by Sjoberg et al. (2002).

MusicFone’s requirements specification document is originally intended for a lengthy programming-oriented experiment (Fucci et al. 2015). In order to address our experimental goals and to abide by the available experiment execution time, we modified the requirements specification document so that it could be used for designing test cases within the available time frame. Leventhal et al. (1994) defined three levels of specifications: minimal (features are sketched only), positive only (program actions on valid inputs) and positive/negative (program actions on valid and any invalid input) specifications. If we relate the completeness of our object’s (MusicFone) specifications to the Leventhal et al.’s (1994), it is closer to a positive only specification document. We cannot classify MusicFone into the third category because the specifications stated the required behaviours but contained less information on handling the non-required behaviour of the application, thus qualifying our object as having a realistic level of specifications (Davis et al. 1993; Albayrak et al. 2009).

In addition to the requirements specification document, we also provided a screenshot of the working version of the MusicFone application UI to serve as a conceptual prototype to enable a better understanding of the application. We ensured that the screenshot of the developed UI was consistent with the provided requirements specification because the presence of errors or feedback from the errors could possibly affect the testing behaviour in terms of a positive test bias (Teasley et al. 1994). In other words, if a tester finds an error then s/he might possibly start looking for more similar errors, which will impact the behaviour of the testing and leads to negative testing (Teasley et al. 1994). However, the participants did not interact with the UI in our experimental setting, as we considered interaction as part of the execution of tests rather than their design.

3.7.3 Test Case Design Template

To ensure consistency in the data collection, we prepared and provided a test case design template to the participants, as shown in Table 4. This template consists of three main columns: test case description, input/pre-condition and expected output/post-condition, along with an example test case. The example test case is provided so that the participants know the level of detail required when designing test cases. Furthermore, these three columns were chosen to aid us in understanding the designed test cases better during the data extraction phase, e.g. in marking them as consistent or inconsistent.

Table 4 Test case design template

3.8 Pilot Run

We executed a pilot run with five postgraduate students to achieve two objectives. First, was to decide on the duration of the two levels (TP, NTP), and second was to improve the instrumentation of our experiment. In addition to meeting the first objective, we improved the wording of the requirements specification document to increase comprehension, based on the feedback from the participants. Two authors of this study independently marked the pilot run test cases as (in)consistent and then resolved discrepancies via discussion. This ensured they shared a common understanding and this knowledge was then applied by the first author in the data extraction.

3.9 Analysis Methods

We analysed the data by preparing descriptive statistics followed by the execution of statistical significance tests of the t-test family and F-test (Hotelling’s T2) to test our hypotheses. To properly perform the statistical tests, we checked whether the data met the chosen test’s assumptions; in the event that it failed to meet assumptions, a non-parametric counterpart of the respective test was performed. Hotelling’s T2 assumes the following:

  1. 1.

    There are no distinct subpopulations and populations of samples have unique means.

  2. 2.

    The subjects from both populations are independently sampled.

  3. 3.

    The data from both populations have a common variance-covariance matrix.

  4. 4.

    Both populations have a multivariate normal distribution.

For univariate and multivariate normality assumptions, the Shapiro-Wilk test was used. We report multiple types of effect sizes depending upon the statistical test run and assumptions considered by the respective effect size measure (Fritz et al. 2012). Cohen’s d (0.2 = small, 0.5 = medium, 0.8 = large) and correlation coefficient r (0.10 = small, 0.30 = medium, 0.50 = large) are for univariate tests and the Mahalanobis distance is applied for the multivariate test. It is important to note that the strengths mentioned for the r effect are not equivalent to those of d strengths because “Cohen requires larger effects when measuring the strength of association” (Ellis 2009). In order to validate the directional hypotheses, we performed one-tailed tests, except for H2, and α was set to 0.05 for all the significance tests. The environment used for the statistical tests and preparing relevant plots was RStudio ver.0.99.892, with external packages including Hotelling (Curran 2015), rrcov, mvnormtest and profileR (Desjardins 2005). Effect sizes r are computed using the formulae provided by Fritz et al. (2012).

4 Execution of the Experiment

In this section, we first describe how we executed the experiment and then elaborate on the data collection steps. Figure 2 presents the flow of the events related to our experimental execution. It visualises the pre-experimental activities, experimental sessions and post-experimental activities. Figure 2 also shows which activities were executed in parallel, e.g. training was conducted in parallel to the pre-experimental activities.

Fig. 2
figure 2

Experimental process flow

4.1 Execution

The execution of the experiment involved activities taking place in two steps. In the first step, participants filled out the pre-questionnaire and signed consent forms reasonably in advance of experimental sessions taking place. However, their completion was verified before assigning the participants to the control and treatment groups on the day of the experiment.

The second step involved running the experimental sessions. The experimental sessions for TP and NTP were run in parallel in two different rooms, to which the participants were randomly assigned. On arrival, every second participant was sent to a different room to achieve a balanced design. The experimenters kept the participants unaware of the reason for their random allocation, but if a participant showed concern, s/he was informed that the activity/content would be the same in both rooms.

After randomisation, the TP session had 22 participants and the NTP session had 21 participants, totalling 43 participants. After distributing the task and the template sheets, the experimenters projected the screenshot of the UI and briefly introduced the task. The task’s introductory note was the same predetermined content for both groups. We then allotted 30 minutes to the TP group and 60 min to the NTP group to complete the task and regulated the time pressure, as elaborated in Section 3.2.

We developed two detailed scripts for the TP and NTP sessions to guide the experimenters through the whole experimental session. The scripts contain the sequence of activities along with time stamps (when to say/do what), the description related to the task and instructions on how to conduct NASA-TLX. When the participants had finished with the task, post-experimental data was collected using NASA-TLX, which is a pen-and-paper activity. The actual duration of the experimental sessions did not deviate from plan.

4.2 Data Collection

We marked all valid test cases designed by the participants as consistent (c) or inconsistent (ic), whereas marking the test case as dropped followed the criteria listed below:

  • Wrong test cases: test cases where the input, expected output and the test case description were not in sync with each other, as well as test cases that resulted from the lack of understanding of the requirements. These test cases were the result of misunderstanding the specifications rather than an actual inconsistent test case.

  • Specification-conflicting test cases: test cases that were in conflict with the specified requirements. In other words, a test case validating a scenario/case that cannot happen when considering the rest of the specified functionality.

  • Repeated test cases: test cases that were duplicates.

Figures 345678 and 9 provide examples of test cases in each category. The transcribed versions, for easier readability, are available in Appendix. For MusicFone’s specifications, please refer to the experimental package provided in Section 3.7.

Fig. 3
figure 3

Example of a designed consistent (c) test case

Fig. 4
figure 4

Example-1 of a designed inconsistent (ic) test case

Fig. 5
figure 5

Example-2 of a designed inconsistent (ic) test case

Fig. 6
figure 6

Example-1 of a designed wrong test case

Fig. 7
figure 7

Example-2 of a designed wrong test case

Fig. 8
figure 8

Example of a designed specification-conflicting test case

Fig. 9
figure 9

Example of a designed confusing test case

In order to reduce subjectivity in the marking, we conducted a round of pilot marking to calculate an inter-rater agreement between the author collecting all the data and another author of the study. The pilot marking involved classifying 108 (28%) test cases of the randomly chosen participants as consistent (c), inconsistent (ic) and dropped. We computed Randolph’s free marginal kappa for inter-rater reliability, since it was not compulsory for each category to contain a certain number of cases. Randolph’s free marginal kappa suggested a substantial agreement with 66% (Randolph 2005, 2008; Landis and Koch 1977). The discrepancies were resolved via discussion to establish a common understanding, and all the test cases were then marked by one of the authors. A few test cases designed by the participants could not be classified because of their confusing nature. We resolved these cases by reaching an agreement after the discussions, which further alleviated the subjectivity in marking.

Figures 34 and 5 present examples of c and ic test cases, respectively, designed by the participants. The test case in Fig. 4 is ic because the test case is validating a functionality that is not declared in the requirements. The requirements only specify the kind of data to be retrieved from the Last.Fm website and the behaviour of the MusicFone application, but the test case (Fig. 4) validates the connection between the Last.Fm server and the MusicFone application, which is certainly necessary for the application to proceed according to its specified behaviour. Figure 5 is an example of a test case that is depicting a tester’s outside-of-the-box attitude, as the participant is validating the required functionality of the itinerary compilation by also considering the location changes of the application’s user. Figures 6 and 7 are examples of dropped test cases. The test case in Fig. 6 is wrong because the data from the Last.Fm is retrieved based on the artist currently playing, not on the basis of the user’s current GPS location. This shows that the participant did not develop a correct understanding of the requirements. The test case in Fig. 7 is wrong because the content in the columns are not synchronised, making the purpose of the test case ambiguous. Figure 8 shows an example of a specification-conflicting test case because it tests whether the application displays the concerts of an artist within a distance of greater than or equal to 500 km, but if that artist is added to the selected artists list, his/her concert (of >= 500 km) is not added (displayed) to the trip itinerary. It is conflicting (erroneous) because the specifications do not impose limitations based on distances, but adds it to the itinerary based on other stated conditions; hence, it was dropped.

The confusing test cases were those which were valid but difficult to classify as c or ic due to the content of the three columns in the design template. Therefore, they were resolved as either c or ic or were dropped after discussion among the authors of the study. An example of a confusing test case is shown in Fig. 9. After discussing and examining the pre-condition and expected output columns, this test case was classified as ic because the participant is validating that concert information fetched from the Last.Fm website truly belongs to the clicked artist.

In total, 14 out of 385 test cases were dropped from the TP and NTP groups: 7 (3.93% TP, 3.43% NTP) from each. The marking also resulted in dropping one participant from the TP group because all of his/her test cases (3 test cases) were dropped. Consequently, the number of participants in both groups became equal, i.e. 21 participants in each group, leading to a balanced design. In accordance with the measurement improvement step mentioned in Section 3.3, we added 13 inconsistent test cases from the 371 test cases designed by the participants to the predesigned set of test cases. As a result, the total number of inconsistent test cases increased to 50 IC. The total number of consistent test cases remained the same, i.e. 18 C, in the final set of test cases designed by us because no new consistent test cases were detected. Based on these data, we then calculated the rate of change values (z) of the extracted data for the TP and NTP groups.

5 Results

We first present the descriptive statistics of the collected data and then the results of hypothesis testing.

5.1 Descriptive Statistics

Table 5 shows the number of c and ic test cases, the number of participants, and the number of early finishing participants in each experimental group. In total, 174 test cases were designed by the TP group, whereas participants in the NTP group designed 197 test cases. The number of ic test cases in both groups is almost the same, but the NTP group designed 21 more c test cases than the TP group. Six out of 21 participants finished earlier than the designated time in the NTP group but none finished earlier in the TP group.

Table 5 Raw data

Table 6 shows the descriptive statistics for the c, ic and z values of the TP and NTP groups in terms of their minimum (min), maximum (max), mean, median (mdn) and standard deviation (sd). Descriptive statistics for the number of c and ic test cases provide insight into the design of consistent and inconsistent test cases by the participants in both experimental groups. We can see from the table that the mean of the c test cases in the TP group is less than the mean of the c test cases designed in the NTP group, i.e. slightly more consistent test cases are designed in the NTP group. The relation for the ic test cases designed in the TP and NTP groups is also similar, that is, on average, more ic test cases are designed in the NTP group than the TP group. The maximum number of c test cases in the NTP group (15) is greater than the maximum number of c test cases in the TP (11) group, which is expected since the NTP group had more time to work on the task. The standard deviation of c and ic are similar in both groups. Considering the z values, the mean of the NTP group is greater than the mean of the TP group with similar variation, which surprisingly suggests relatively more confirmatory behaviour in the absence of time pressure.

Table 6 Descriptive statistics

The box plots in Fig. 10 present the spread of the c and ic data of the TP and NTP groups. We can see that the median of c in the NTP group is greater than the median of c in the TP group and similar to the case of the minimum and maximum values of c in both groups. The median of ic in the NTP and TP groups follows the same trend, however, the minimum value is 0 in both groups. The maximum value of ic in the TP group is greater than the maximum value of ic in the NTP group.

Fig. 10
figure 10

c and ic data of the TP and NTP groups

Due to the scale difference between the measurements of absolute counts and the z values, we present the box plots of z in the TP and NTP groups in Fig. 11. We can see in Fig. 11 that there are no outliers in the TP group, whereas there are three in the NTP group, affecting the descriptive statistics, e.g. the mean value. However, there is a slight difference between the median values of the two groups. The minimum value of the NTP group is clearly greater than the minimum value of the TP group. Finally, the spread of the values in the first quartile and the fourth quartile of both groups are almost equal to each other.

Fig. 11
figure 11

Rate of change (z)

The z values for both groups reveal that all the participants designed relatively more consistent test cases because all the values of z are greater than 0. Moreover, none of the participants have designed a relatively equal number of c and ic test cases in either group, as minimum the value of z is 0.062 (TP). However, when we examine the absolute counts, there are a few participants who designed an almost equal number of c (consistent) and ic (inconsistent) test cases. In general, the number of inconsistent test cases in both groups are remarkably few when compared to the consistent cases. Additionally, the absolute counts of c and ic suggest that there are more participants in the TP group who have not designed any inconsistent test cases at all. The descriptive statistics show that on average, participants in the NTP group designed comparatively more test cases. This indicates that the participants in the NTP group gave more coverage to the test cases as compared to the TP group - which draws attention to the fact that the NTP group had relatively more time than the TP group. However, it is interesting that the coverage in the NTP group is not exceptionally more than in the TP group. It is also evident from the descriptive statistics that despite the time pressure, participants in both groups designed more consistent test cases than inconsistent test cases.

Figure 12 presents the box plots of the temporal demand attribute of NASA-TLX (time pressure perceived by the participants) for the TP and NTP groups. There is a major difference between the medians of the TP and NTP groups, i.e. 83 points and 55 points, respectively. However, the minimum value for both groups is the same, i.e. 15 points, but the maximum value of the TP group (100 points) is much higher than the maximum value of the NTP group (75 points) which is even less than the median of the TP group (83 points). These box plots suggest that the temporal demand perceived by the TP group is much greater than the temporal demand perceived by the NTP group when performing the same task.

Fig. 12
figure 12

NASA-TLX temporal demand

5.2 Hypothesis Testing

In this section we present the results of hypothesis testing for H1, H2, H3 and H4.

5.2.1 Hypothesis 1

In order to test H1, we first test it for the pooled data of the TP and NTP groups. For further analysis, we perform statistical testing on the TP and NTP data separately, as well. The normality assumption for the pooled data was tested for c and ic, where ic data failed to satisfy with pvalue = 5.987e − 08. Similarly, the TP and NTP groups also failed to satisfy the assumption of normality of their ic data, with pvalues of 4.727e − 06 and 2.06e − 4, respectively. Therefore, significance testing was performed with the Wilcoxon signed-rank test (a non-parametric variant of the t-test) for all these cases. We applied the Bonferroni adjustment α = 0.016 to the pooled case and for TP and NTP testing. The null hypothesis is rejected with a pvalue = 2.161e − 08, an effect size r = 0.598 (large) and a df = 41 for the pooled data. The null hypothesis for NTP was rejected with a pvalue = 5.48e − 05 and the effect size r is 0.600, which is large. The test on the TP data also rejected the null hypothesis with a pvalue = 5.8e − 05 and an effect size r = 0.597, which is again large, and df is the same for both statistical tests for the NTP and TP, i.e. 20.

As a result, we reject H10 and find evidence to support that testers design more consistent test cases than inconsistent test cases.

5.2.2 Hypothesis 2

We performed a multivariate test of the mean differences - Hotelling’s T2 for the TP and NTP groups for the two dependent variables, c and ic, after checking the test’s assumptions. Note that z was not included in this test because of its high correlation with c and negative correlation with ic, which could violate the multivariate normality assumption of this test. Hence, we statistically tested z separately. The first two assumptions of Hotelling’s T2 hold. The results of Bartlett’s test revealed that both dependent variabless, c and ic, satisfy the assumption of a common variance-covariance matrix with pvalues = 2.3e − 1 (df = 1) and 3.6e − 1 (df = 1), respectively. Additionally, the last assumption of multivariate normality was satisfied after performing natural log transformation on the ic data of the TP and NTP groups with pvalues = 3.5e − 1 and 3.26e − 1, respectively.

Hotelling’s T2 test was unable to detect a statistically significant difference between the two groups, i.e. we fail to reject H20, T2 = 5.012,F(2,39) = 2.44,pvalue = 1.001e − 1,MahalanobisD = 0.691 and variance is η = 0.111 with a null confidence interval. The result of this test suggests that time pressure may not have an effect on the (dis)confirmatory behaviour of testers.

Figure 13 presents the profiles of c and ic, which implies that they may not be parallel. Furthermore, null hypotheses testing of whether the profiles are parallel is rejected with pvalue = 4.145709e − 01, suggesting an interaction between the variables. This might imply that in the NTP group, participants designed more consistent test cases; however, it can be due to the floor effect of the low number of ic in both groups. H3 proceeds with an additional analysis of this.

Fig. 13
figure 13

Profile Plot

5.2.3 Hypothesis 3

To determine the effect of time pressure on confirmation bias in relative terms, we first performed the normality tests on the z data of both groups, and the results revealed that our data are normally distributed with a pvalue = 2.54e − 1 for the TP group and a pvalue = 3.52e − 1 for the NTP group. Therefore, we executed two-sample t-test that failed to reject the null hypothesis with a pvalue = 8.72e − 1 and a 95% confidence interval is (− 0.131, Inf). Degree of freedom is 40 and the effect size d is − 0.357, indicating that the effect is small, although the confidence interval contains 0 and it decreases the mean in the direction of the TP group.

5.2.4 Hypothesis 4

To test the sanity check hypothesis, which was developed to validate the operationalisation of time pressure, we first inspected the normality of the data for the TP and NTP groups. For this, we considered the data of the temporal demand attribute from the NASA-TLX scaled measurements. The normality tests revealed that the data for the TP and NTP groups do not follow a normal distribution. Hence, we performed a two-sample Mann-Whitney test yielding a pvalue = 1.231e − 3, thus we reject the null hypothesis, revealing that time pressure or temporal demand perceived by the TP group is significantly more than the temporal demand perceived by the NTP group. The effect size r is − 0.469, which is medium, and df = 40.

5.3 Summary of Results

Table 7 presents a summary of the hypothesis testing results. The effect size column lists the effect size along with its type, e.g. r. We can see that H10 is rejected, signifying that consistent test cases are designed more often than inconsistent test cases. We fail to reject H20, which shows that time pressure may not significantly affect the (dis)confirmatory behaviour. The case is similar with the results of H3: we did not observe the manifestation of confirmation bias at different rates between the TP and NTP groups. These results need further investigation with a larger sample size. However, we are confident in the successful operationalisation of time pressure, which is evident from the results of H40.

Table 7 Hypothesis results summary

H1 results provide evidence that testers exhibit a confirmatory behaviour while designing test cases, i.e. they designed considerably more consistent test cases than inconsistent ones. This suggests that participants were highly biased in their approach towards designing test cases. On the other hand, comparison of the experimental groups for observing the effect of time pressure in terms of the designed test cases, H2, and with the rate of change aspect, H3, did not reveal any evidence for an effect due to time pressure. Specifically, in our experiment, time pressure did not cause the participants to manifest more confirmation bias either in absolute terms or relative terms. On the other hand, confirmation bias has always manifested, regardless of time pressure, in the designing of test cases. The hypothesis that we developed to validate the values of our levels of time pressure (H4) revealed that the time pressure perceived by the TP group was significantly more than the time pressure perceived by the NTP group. Conclusively, the results of this hypothesis suggest that we successfully manipulated time pressure for the TP and NTP groups. Despite the use of NASA-TLX for measuring the task difficulty as perceived by the task performer, to our knowledge there is no standard way to interpret the findings.

6 Discussion

The reasons for these results are manifold. The first possible reason is the object of our study, MusicFone, which is a realistic and complex object. Accordingly, the participants may have faced problems in designing test cases for the object despite being provided with the conceptual prototype (screenshot) to help understand it. This is partly supported with the observation that the coverage of specifications is low not only for inconsistent test cases but also for consistent ones. Additionally, we observed that no large differences existed in the coverage between the TP and NTP groups. It is conceivable that the usage of a toy object in the experiment might have influenced the results differently, though such an approach would have compromised the realism aspect.

Besides the complexity of the task, one other reason for the results could be the participants’ unfamiliarity with the domain of the provided object. If the participants had been trained in or were experienced in a similar domain to that of MusicFone, it might have also produced relatively diverse results. But in such a case, imposing similar time limits to create time pressure would be unjustifiable, as their familiarity with the domain might not have allowed them to experience the time pressure to its full extent. Therefore, to observe the effect of domain, care has to be taken in deciding the duration when creating time pressure.

Another reason for the results can be participants’ (in)experience in testing, as investigated by Calikli and Bener (2010, 2014) and Teasley et al. (1994), because there was room for the participants to design more inconsistent test cases, especially the NTP group. Yet, our participants can be considered novice professionals because most had less than three years of industrial experience. The chances that our results are affected by the given training on equivalence partitioning strategy are low. In equivalence partitioning, test cases are designed based on the identification of valid and invalid classes. We considered a test case consistent if the functionality was explicit in the specifications and inconsistent otherwise, irrespective of the class from which the test case was derived. The same concept of (dis)confirmatory behaviour applies to the software testers in industry as well. Regardless of the strategy they follow in designing test cases (e.g. equivalence partitioning, boundary testing), if the specifications are explicit about a functionality then a test case validating the respective specification is consistent (confirmatory) and inconsistent (disconfirmatory) otherwise.

Unfamiliarity with the domain, together with limited experience, may have further enhanced the effect of task complexity. This may have made the time alloted inadequate for the control group as well. As a result, no significant differences existed in the (dis)confirmatory behaviour between the groups due to time pressure. The participants challenged with completing the task within the available time might have kept on exploring and understanding the complex object in parallel to designing test cases. This may have caused the designing of more consistent test cases as compared to inconsistent ones in both groups. The total number of test cases designed by each of the 6 participants who finished early in the NTP group range from a minimum of 6 to a maximum of 15 test cases in the NTP group. The maximum number achieved by an early finisher indicates that s/he designed all the possible test cases. In other words, s/he could not think of any more test cases and thus finished before the end of the allotted time for the NTP group. The minimum number of test cases (in the NTP group) achieved by the group of early finishers possibly indicates an inclination to quit. Both situations depict less experienced participants handling a complex task. However, the data of the six participants who finished early is not enough to support the presumption of added task complexity. One reason for the relatively low number of inconsistent test cases in both groups could be no feeling of accomplishment. Pham et al. (2014) found this aspect to be an inhibiting factor with respect to testing. The feeling of accomplishment comes from the detection of actual defects, but since students are novice testers, they do not have this experience (Pham et al. 2014). Pham et al. (2014) investigated the enablers, inhibitors and perceptions of testing by examining student projects that involved development. In relation to our study, the participants might have also had an inadequate feeling of accomplishment because of their lack of considerable or industrial experience with testing. This inadequate feeling could also be due to the fact that our participants did not execute the test cases, since they were only limited to designing them. If they could have executed the test cases, especially the inconsistent ones, defect detection (if any) could have motivated them to design and execute more inconsistent test cases, thereby enriching them with the feeling of accomplishment. However, their interaction with the application (MusicFone) for the execution of the test cases and the presence of defects could have created construct and internal validity threats. As mentioned in Section 3.7.2, the presence of or feedback from errors can possibly affect the testing behaviour in terms of a positive test bias (Teasley et al. 1994).

The completeness of specifications is also an important dimension with respect to the object used in our study. Complete requirements suggest that every required and non-required behaviour is stated in explicit terms. In such a scenario, test cases designed to confirm the requirements are all consistent test cases, which is a depiction of confirmatory behaviour or a manifestation of confirmation bias. Manifestation of confirmation bias with complete specifications is then not adverse in software testing because every required and non-required behaviour is confirmed via testing. Our object, having a realistic level of completeness, provided the participants with an opportunity to design inconsistent test cases, including outside-of-the-box cases. Nevertheless, participants possibly struggled due to their inexperience, domain unfamiliarity and lack of feeling of accomplishment, which limited them in designing substantial inconsistent test cases. Our results support the findings of Causevic et al. (2013), Leventhal et al. (1994) and Teasley et al. (1994) that the participants exhibited a significant confirmatory behaviour in designing the test cases. The results are also similar to the findings of Mäntylä et al. (2014), Nan and Harter (2009) and Topi et al. (2005), as time pressure did not affect the dependent variable. The results of our study also challenge the results of Leventhal et al. (1993) and Leventhal et al. (1994), as they interpreted the results with respect to valid (positive) and invalid (negative) equivalence classes. The results of these studies might differ if the test cases are reassessed according to (in)consistent terminology. In the case of Causevic et al. (2013), it is unclear how they distinguished between the test cases as positive and negative, despite defining the terms; “the experiment performed in this study has a built-in mechanism of differentiating whether a particular test case is of a positive or negative type” (Causevic et al. 2013, p. 3).

The testers’ confirmatory behaviour in designing test cases may compromise the quality of testing which consequently may deteriorate software quality. It is harmful because testers tend to confirm the functionality of the program rather than test it to reveal its weaknesses and defects. In other words, they do not test all aspects of the functionality, e.g. corner cases. This confirmatory approach may contribute to a low external quality, thus elevating development and testing costs. As mentioned earlier, the phenomenon of confirmation bias with complete specifications may not have a deleterious impact. Yet, it may become problematic due to the prevalence of incomplete requirements specifications in the SE industry that fail to elicit every functionality, especially for large and complex systems (Albayrak et al. 2009; Paetsch et al. 2003). Software testers design test cases based on their understanding of (often incomplete) requirements specifications, which may lead to the incomplete coverage of exceptional cases: inconsistent test cases. Therefore, it is essential that the testers develop a disconfirmatory outside-of-the-box attitude towards software testing. Disconfirmatory attitude development should also be taught to software engineering students alongside the teaching of standard testing techniques, e.g. equivalence partitioning. Furthermore the degree to which time pressure promotes the confirmatory attitude among testers remains inconclusive, as we have not detected statistically significant evidence to show whether this is present.

7 Threats to Validity

In this section, we analyse the threats to the validity of our study and discuss the relevant ones, following the list provided by Wohlin et al. (2000).

7.1 Construct Validity

The interaction of task and treatment, due to a one-factor two-treatment experimental design with one object, is a shortcoming of our study in terms of the mono-operation bias threat. This caused the results to be limited to the single object (MusicFone) alone. Another possible threat is related to the total count of inconsistent (IC) test cases. In the validity improvement step, we increased the IC count through the addition of 13 test cases, and it is possible that this count would further increase if this study is replicated. One possible effect of this could be the observation of results that are more in favour of confirmatory behaviour or confirmation bias manifestation, since the higher the IC value, the lower the coverage of inconsistent test cases. The mono-method bias threat was addressed by executing a pilot run and through the resolution of the confusing test cases. Our study is not prone to the threat of testing and treatment interaction because the participants were asked to perform the experimental task as if it were a normal class activity; additionally, they were unaware of the treatment that could impact the results. The example test case provided in the test case design template was a low-hanging fruit. Although it was a consistent test case example, we do not think that it biased the participants to design either consistent or inconsistent test cases.

7.2 Internal Validity

Here, we discuss the threats to internal validity that are related to multiple groups because our study involved two groups. In our study, we taught and trained all the participants together and randomly assigned them to the experimental and control groups. In this way, we avoided the interaction with selection threat. In our opinion, the provision of a conceptual prototype in the experimental run, that was different from the training, did not influence the results because it was provided to both the experimental and control groups. Moreover, we applied the treatment to the experimental group in parallel to the control group, but in a different room; thus, we prevented the imitation of treatments threat. We addressed the threat of compensatory equalisation of treatment by not compensating either the treatment or controlled group in a special way. Additionally, we improved the instrumentation as a result of the pilot run and we also enhanced the baseline test case set from the participants’ test case data. Moreover, since all the participants underwent the same training, this precluded compensatory rivalry or resentful demoralisation. Regarding the experimental designs, we did not choose other designs because of the limitations of the course’s time frame. The announcement of bonus marks as an incentive to participate in the experiment was not a threat to the internal validity because the incentive was neither coercive nor constituted undue influence (Commission 2012).

7.3 External Validity

We can categorise our participants as a proxy for novice professionals based on their experience, as discussed in Section 3.6. However, the use of student participants rather than professionals poses a selection and treatment interaction threat to the external validity, for they were first-year graduate students, with many having less than six months of industrial experience. Moreover, we conducted the study in an academic environment and manipulated time pressure in a controlled and psychological manner, i.e. the other factors that could affect the application of the treatment were not prominently present. However, we used a realistic object for our study; therefore, we tried to achieve realism for the task. Additionally, we are not of the opinion that the use of pen and paper for performing the task is unrealistic because our study focused on the designing phase of test cases rather than their execution. Conclusively, we have addressed the interaction of the setting and treatment threat to the best possible extent considering the context of the experiment.

7.4 Conclusion Validity

It is possible that we are committing a Type II error by failing to reject the null hypotheses (H20, H30) related to time pressure’s effect on confirmation bias manifestation, because the effect size indicates there might be an effect, albeit small. This needs to be addressed via replications with larger samples. We addressed the threat of violated assumptions by testing our data for them before validating our hypothesis statistically. We have addressed the reliability of measures threat by executing a pilot run and computing an inter-rater agreement for evaluating test cases as (in)consistent. This established a common understanding among the authors of the study and avoided taking the subjective view of a single evaluator alone. The reliability of treatment threat was addressed by preparing a scripted guideline for the experimental and control groups to help the experimenters implement the treatment and control uniformly and through the sanity check hypothesis.

8 Conclusion

In this study, we examined the manifestation of confirmation bias in a functional software testing context and considered whether - if such a manifestation exists - it is affected by time pressure. In order to address our goals, we executed an experiment with 42 graduate (master’s degree) students by manipulating the time pressure in the treatment and control groups. We have detected the manifestation of confirmation bias in terms of the specification-consistent and specification-inconsistent test cases. Therefore, our study contributes to the existing body of literature by providing empirical evidence in support of the confirmatory behaviour in functional software testing and the investigation of time pressure’s effect on confirmation bias in SE.

Pertaining to the observed results, the answer to RQ1 is that our study documents the confirmation bias manifestation, in other words, confirmatory behaviour, in designing functional test cases, similar to previous research findings (Causevic et al. 2013; Leventhal et al. 1994; Teasley et al. 1994). Test designers generated significantly more, and with a large effect size, specification-consistent test cases. For RQ2, however, we do not have evidence for an effect on confirmatory behaviour due to time pressure.

In terms of our recommendations to practitioners, we should revisit the terms consistent and inconsistent test cases in the context of functional software testing. Our definition of a consistent test case in relation to confirmation bias suggests that confirmation bias may not always be a negative phenomenon, provided that the requirements specifications are complete and well-elicited in terms of the required and non-required functionality or behaviour. However, this is not usually the case in the fast-paced software industry. Therefore, the reality is that in practice, confirmation bias is more likely a phenomenon that adversely affects software quality through limiting the verification of unspecified, yet essential, software behaviour. While it would be unrealistic to recommend changes to specification elicitation practices, we can make recommendations targeted at individuals working in the testing profession. We recommend that testers develop self-awareness of confirmation bias and counter it with a disconfirmatory attitude.

For researchers, the results of our study suggest that there is a need to replicate or extend this experiment with larger samples in order to detect the effect of time pressure, if any, and to investigate other possible factors that might have an impact on confirmation bias or can possibly mitigate the effect of time pressure on confirmation bias. For example, task complexity, domain familiarity and experience as confounding factors are among the possible extensions for future work. A qualitative study is also planned that would help in identifying whether software testers prefer confirmatory test cases to begin testing and what factors (e.g. time pressure) influence their preferences and behaviours.