Test environments for large‐scale software systems—An industrial study of intrinsic and extrinsic success factors

The characteristics of the test environment are of vital importance to its ability to support the organizations testing objectives. This paper seeks to address the need for a structured and reliable approach, which can be used by companies and other organizations to optimize their test environments in each individual case. The reported study included a series of interviews with 30 individuals, a series of focus groups with in total 31 individuals and a cross‐company workshop with 30 participants from five large‐scale companies, operating in different industry segments. The study resulted in a list of success factors, including not only characteristics and capabilities existing within a test environment (intrinsic success factors) but also properties not inherent to the test environment, but still vital for a successfully implemented test environment (extrinsic success factors). This distinction is important, as the root causes differ and as addressing them requires distinct approaches—not only of technology but also of organization, communication and collaboration. We find that successful implementations of test environments for large‐scale software systems depend primarily on how they support the company's business strategy, test organization and product testability (extrinsic success factors). Based on this, test environments can then be optimized to improve test environment capabilities, usability and stability (intrinsic success factors). The list of intrinsic and extrinsic success factors was well received by all five companies included in the study, supporting that the intrinsic and extrinsic success factors for test environments can be applied to a large segment of the software industry.

In our previous work, we have multiple times touched upon problems and challenges related to test environments. In our studies of continuous integration impediments, 'reliability of test environments' was identified as one of twelve factors that could enable more frequent integration of software [1]. Furthermore, 'simulated test environment or real hardware' is one of the main aspects to consider when designing a continuous integration and delivery pipeline [2], and 'test environments that support debugging and recording' are one of the key factors that enable efficient and effective exploratory testing [3].
Based on this, we find that the characteristics of the test environment are of vital importance to its ability to support the organizations testing objectives. This paper seeks to address the need for a structured and reliable approach, which can be used by companies and other organizations to optimize their test environments for large-scale and complex software systems.

| Terminology
As always, some confusion exists around terminology, that is, is the right wording 'test environment', 'test infrastructure', 'testware' or something else. IEEE Standard Glossary of Software Engineering Terminology (1990) uses the term 'test bed' and defines this as 'an environment containing the hardware, instrumentation, simulators, software tools, and other support elements needed to conduct a test' [4].
In this publication, we rely on the IEEE definition and use the term 'test environment' as this seems to be the established terminology in more recent publications, showing in, for example, the study from Afzal et al., which compares 18 software test process approaches [5].

| Research question
Continuous practices [6], including continuous integration, delivery and deployment, are now firmly established in the mainstream of the software engineering industry. However, the original definitions of continuous integration [7] and continuous delivery [8] speak of 'members of a team' as the size of the organization and 'network topology and firewall configuration' as the system's infrastructure.
In the companies we work with as researchers, software development for a product has grown to involve hundreds or even thousands of engineers instead of a single team. Even though software is an increasingly important part of the product, many products (such as cars or mobile phones) also include complex electronic and mechanical systems. We have seen that for these types of systems, development of test environments is no longer a trivial task, but instead a large-scale and complex system in itself, including integration of bespoke hardware and simulator models representing physical systems. We have partially addressed challenges related to test environments in our previous work, but we have not previously addressed this field of research with a holistic approach. In response to this, the purpose of this study is to investigate this field of research in a more structured way. The topic of this research paper is to answer the following research question: What are the success factors for test environments that can enable efficient and effective testing of large-scale software systems?
The importance of certain success factors is discussed in many research papers related to software engineering (some examples listed by Dybå [9]), and the term can be used in different meanings. By success factors, we include all types of aspects that contribute to, or influence the optimization of test environments in an organization, that is, not limited to factors related to software process improvement (SPI), for example, as listed by Rainer and Hall [10].

| Contribution
The contribution of this paper is two-fold: First, it presents detailed results from interviews and workshops with participants from five companies, that give researchers and practitioners an improved understanding of problems and challenges related to test environments, as perceived in large-scale industry companies. Second, it presents a list of success factors for test environments enabling efficient and effective testing of large-scale software systems, which can be used by companies and other organizations to optimize their test environments in each individual case.
The remainder of this paper is organized as follows: In the next section, we present the research method. This is followed in Section 3 by a study of related literature. In Section 4, we present the results from the primary interviews, followed by a presentation of the results from the focus groups in Section 5. Section 6 presents the analysis of the interviews and the focus groups, followed by a summary of the cross-company workshop with all five companies included in the study. Threats to validity are discussed in Section 7. The paper is then concluded in Section 8, including a description of further work.
In the literature review conducted in the study, we identified a need for secondary studies related to test environments (also described by Garousi and Mäntylä [11]). In response to this, we have included detailed descriptions of the results from the interviews and the focus groups (presented in Sections 4 and 5), intended primarily for researchers conducting secondary studies or allowing replication of the study. The analysis of the results from the interviews and the workshops is presented in Section 6, intended for practitioners and researchers.

| Overview of the research method
The research study reported in this paper consists of four major steps: • Step 1: Reviewing literature: a literature review to investigate how pain points or success factors for test environments for software systems are described in literature (presented in Section 3). • Step 2: Primary interviews: a series of interviews to provide insights from a large number of interviewees (presented in Section 4). • Step 3: Focus groups: workshops with focus groups to provide insights from companies in different industry segments (presented in Section 5). • Step 4: Analysis and cross-company workshop: analysis of the results from the primary interviews and the focus groups, presented at a cross-company workshop with representatives from all five companies in the study (presented in Section 6).
An overview of the research method is presented in Figure 1: As a first step, a literature review was conducted in order to look for solutions in published work. As the next step, a series of interviews were conducted with representatives from one of the studied companies. This was followed by focus groups with the representatives from the other four companies. As the final step, the analysis of the interviews and focus groups was presented at a cross-company workshop, confirming and complementing the findings from the previous steps in the study. Figure 1 also shows how the five companies were included in the different steps of the study. The research method for each part of the study is further described in Sections 2.2-2.5.
The study was purposely designed with interviews (Step 2) and focus groups (Step 3) to allow method triangulation, following the guidelines from Runeson and Höst [12]. The focus group format was chosen to allow the participants to explore and clarify their views in ways that would be less easily accessible in a one-to-one interview (as described by Kitzinger [13]). The input from the primary interviews was used by the moderator to mitigate the problem how focus groups in most cases tend to become influenced by one or two dominant people (described by Greenbaum [14]), as further described in Section 2.4.
The study includes five large-scale industry companies, referred to as Company A, Company B, Company C, Company D and Company E. The companies operate in five separate industry segments: • Automotive products and services • Communications systems and services • Services and solutions for military defense and civil security • Transport solutions for commercial use • Video surveillance cameras and systems Due to non-disclosure agreements, we do not refer to the companies in the study with company names or specify the industry segment of each company. The companies included in the study are all multi-national organizations with more than 2000 employees. All companies develop large-scale and complex software systems for products, which also include a significant amount of mechanical and electronic systems (including standard components as well as bespoke hardware). The companies were purposely selected as suitable for the study, as they have similar characteristics, but at the same time are operating in different industry segments. To investigate whether solutions related to the research question have been presented in published literature, a literature  review was conducted (with the question driving the review phrased as 'How are pain points or success factors for test  environments for software systems described in literature?'). As the contribution of the study is not the strict academic procedure but in the industrial relevance, the literature review was not conducted as a systematic literature review but instead optimized to provide a better understanding of previously published literature related to the research question (informing the setup for the interviews and the focus groups).

| Reviewing literature
The literature review involved two of the researchers in order to secure quality and correctness. Due to the large number of papers that are in some way related to the topic, the objective of the literature review was not to conduct a complete and exhaustive review (iterations until no new papers are found) but to provide an overview of published literature in order to inform the setup and design for the interviews and focus groups. With this strict limitation on the number of iterations and number of additional searches, the literature review was conducted following the guidelines for literature studies from Wohlin [15]: • Identify keywords and formulate search strings to define a start set with publications selected from the initial search, based on their relevance for the question driving the review. • Conduct iterations including 'backward and forward snowballing', that is, examining the reference list of the publications in the start set, and publications citing the publications in the start set. The identified publications are included in the review based on their relevance for the question driving the review. • Perform additional searches based on the authors of the included papers, and the journals and conferences where included papers are published. • All papers selected to be included in the literature review go into data extraction, which should be conducted in accordance with the research questions posed in the study.
As a complement to the review of academic papers, a number of often cited books were also reviewed in order to provide input from practitioners. The reviewed books were selected based on experiences from our previous studies [5][6][7] on integration and testing of large-scale and complex software systems, as described in Section 1.
The results from the literature review were a better understanding of previously published literature related to the research question, which later in the study was used to discuss the novelty of the results from the study.

| Primary interviews
As a first step to find a solution for the research question presented in Section 1, a series of interviews were conducted, including 30 individuals from one of the studied companies (Company A). The purpose of this first series of interviews was to provide insights from a large number of interviewees in the same organization on how they described problems and experiences from their test environments. All interviewees were working part-time or full-time with testing at Company A.
The interviews lasted from half an hour up to (in most cases) 1 h. They were conducted as semi-structured interviews [16], using an interview guide with pre-defined specific questions (presented in Section 4.1). One of the two interviewers was summarizing and transcribing the interviewee's response during the interview, and the responses was read back to the interviewee to ensure accuracy. The interview results were analyzed by two of the researchers based on thematic coding analysis as described by Robson and McCartan [17], outlined in the following: • Familiarizing with the data: Reading and re-reading the transcripts, noting down initial ideas.
• Generating initial codes: Extracts from the transcripts are marked and coded in a systematic fashion across the entire data set. • Identifying themes: Collating codes into potential themes, gathering all data relevant to each potential theme. Checking if the themes work in relation to the coded extracts and the entire data set. Revising the initial codes and/or themes if necessary. • Constructing thematic networks: Developing a thematic 'map' of the analysis.
• Integration and interpretation: Making comparisons between different aspects of the data displayed in networks (clustering and counting statements and comments, attempting to discover the factors underlying the process under investigation and exploring for contrasts and comparisons). Revising the thematic map if necessary. Assessing the quality of the analysis.
The results from the analysis of the first series of interviews were a thematic map of problems and improvement areas for test environments, as described by the 30 interviewees from Company A.

| Focus groups
The next step of the study was a series of focus groups, including each of the studied companies (Company B, Company C, Company D and Company E). The series of focus groups involved in total 31 participants. The purpose of the focus groups was to gain insights from companies in different industry segments and to compare the results from the primary interviews with Company A with responses to the same questions from focus group participants from Company B to Company E. The focus groups included engineers with senior roles with regards to testing and test environments in each company, such as senior test leader. The focus group participants were selected to cover different aspects of testing and test environments, forming a focus group with between six and ten participants as recommended by Morgan [18].
The workshops with the focus groups interviews lasted for between 3 and 4 h. Two of the focus groups were held with all participants at the same physical location, and the other two as virtual Teams meetings, depending on the Covid-19 situation at the time for each focus group. The participants in each focus group were asked the same questions as in the primary interviews (presented in Section 4.1). After the first responses to each question, the results for the same question in the primary interview (with Company A) were presented for the focus group, one question at a time. The responses from Company A were in this way used as a catalyst for further discussions on areas previously not touched upon and enabled discussions on differences and similarities between the companies.
Two of the researchers were present at the workshops with each of the focus groups, summarizing and transcribing the discussions during the workshops. In some cases, responses were read back to the interviewee to ensure accuracy, but without interrupting the interaction between the focus group participants (following the guidelines from Kitzinger [13]). After each focus group, the two researchers compared their notes from the workshop, summarizing and analyzing the results from each focus group.
The results from the analysis of the results from each of the focus groups were a better understanding of problems and improvement areas for test environments, as described by each of the companies (Company B-Company E).

| Analysis and cross-company workshop
As a final step, the results from all interviews and focus groups were analyzed with thematic coding analysis as described by Robson and McCartan [17], following the same methodology as described in Section 2.3. The process was conducted iteratively to increase the quality of the analysis, reaching consensus within the group of researchers through discussions and visualization in diagrams and text. Comments and statements from the interviewees were as a first step sorted into categories, which during the process were reorganized into new structures.
The analysis of the interviews and focus groups was presented at a cross-company workshop with 30 participants from the five companies in the study. The purpose of the cross-company workshop was to confirm and complement the findings from the previous steps in the study. The participants at the cross-company workshop were selected from the companies to include both participants who had been part of the focus groups and individuals not previously involved in the study in order to add new perspectives. All participants had senior roles with regards to testing and test environments in each company.
The cross-company workshop lasted for 4 h (with an additional lunch break) with all participants at the same physical location. At the workshop, the researchers presented a summary of the analysis of the previous steps in the study. This was followed by breakout sessions in smaller groups, allowing the participants to discuss the presented results. This was followed by a summarizing session with all 30 participants, discussing the findings from the breakout sessions. The purpose of the break-out sessions was to allow interaction where participants could ask questions of each other, as well as to re-evaluate and reconsider their own understandings of their specific experiences (as described by Gibbs [19]). The purpose of the session with all participants was to enable discussions in new constellations and to further strengthen generalizability of the findings.
The results from the analysis and the cross-company workshop was a list of success factors for test environments enabling efficient and effective testing of large-scale software systems, based on input from all five companies in the study.

| Identifying relevant literature for the review
The first step in the study was to conduct a literature review, using the reference list of a research paper or the citations to the research paper to identify additional papers [15]. The steps in the literature review are described in detail in Section 2.2. The question driving the review was 'How are pain points or success factors for test environments for software systems described in literature? ' Following the guidelines from Wohlin [15], Google Scholar was selected as a search engine to identify a start set for the review, as Google Scholar (according to Wohlin) should be considered as 'a good alternative to avoid bias in favor of any specific publisher'. An initial search on Google Scholar with the search string ['test environment'] yielded 165,000 results. In order to narrow the search scope, existing systematic literature reviews were used as a starting point for the literature review. A search string that could provide a start set for the review was identified with trial searches, following the guidelines from Wohlin to identify good start set (a start set with papers from different communities, a start set not to small, a start set not too big, a start set with papers from different publishers, years and authors and a start set avoiding papers using a specific terminology). The search string ['test environment' AND 'systematic literature review'] on 13 December 2021 yielded 1280 results. The top 30 publications from this search, as presented by relevance by the search engine, were selected as the start set for the review (following the guidelines from Wohlin to select the most relevant papers if too many papers are found). As a first step, title and abstract of the publications in the start set were reviewed. As a second step, the full-text document was reviewed, and additional relevant publications were identified from the reference list of each paper. Additional searches were conducted, based on, for example, authors of the identified papers. Finally, the results from the review of the 38 papers selected for inclusion were collated and summarized.
As a complement to the review of academic papers, 12 often cited books were also reviewed in order to provide input from practitioners. The reviewed books were selected based on experiences from our previous studies [5][6][7] on integration and testing of large-scale and complex software systems, as described in Section 1.

| Literature review of academic publications
In the review of academic publications, we identified no systematic literature review focusing primarily on test environments. This observation corresponds well with how Garousi and Mäntylä [11] describe test environments as one of several over-looked areas in academic publications related to testing of software systems: 'The existing secondary studies have mostly focused on test-case design, test generation and execution. This leaves the need for secondary studies in the areas of test-environment development and setup, test results evaluation and reporting'. The same type of criticism also exists for software testing models such as TMMi (Test Maturity Model integration), for example, Hrabovsk a et al. [20] state that a problem with TMMi is the 'lack of important key areas such as Test Environment'.
We found that the publications included in the literature review address different types of challenges or problem areas but often leaving out areas that other authors consider to be the core issues. For example, Wiklund et al. [21] describe three impediments for software test automation: no test environment available, inadequate environment configuration management and untested test environment. A similar (but only partly overlapping) summary from Ramler and Gmeiner [22] includes experiences from four challenges related to test environments for automating testing: the right test environments available on time and on budget, test environments for products including hardware and software systems, software and hardware variants and support for automated testing. In the literature review, we also found that many publications only provide generic and shallow recommendations, such as 'test environment has to be organized and available for testing when needed' [20].
Some studies focus on challenges related to hardware: Breivold and Sandström [23] describe hardware-related challenges in software testing as availability of test infrastructure, fault tolerant facilities, flexibility of test environment setup and fault reproduction for debugging. Breivold and Sandström identify relevant research studies that address these issues using 'virtualization technology' and analyze their applicability in the industry domain. Stanik et al. [24] address the same type of challenges with 'hardware as a service', which allows suppliers to save money for the development of testing environment, as test cases 'can be shared in the federated cloud'. Parveen and Tilley [25] discuss when to migrate software testing to the cloud and address two perspectives: the characteristics of an application under test and the types of testing performed on the application. Other publications are even more domain-specific: Sebastian et al. [26] outline a 'seamless test environment for embedded wireless networks' and describe how an efficient design flow requires 'a testing environment, which is continuous, simple and automated at various levels in the development life cycle'. Gomez and Bajaj [27] discuss test environments for Internet of Things (IoT) systems and identify two challenges: no clearly defined architecture for IoT systems and high complexity for IoT systems as they require 'a lot of interoperability between different layers and devices'. Laukkanen et al. [28] investigate problems when adopting continuous delivery and find 'test flakiness' and 'resource problems' related to test environments.

| Literature review of books related to testing
The review of often cited books revealed a similar pattern as the review of academic publications: Books about testing do in many cases not at all discuss questions related to test environments or describe only certain aspects of the topic. For example, Hendrickson [29] and Whittaker [30] both describe manual exploratory testing but do not discuss characteristics of test environments suitable for that type of testing. Kaner et al. [31] recommend to 'give special handling to bugs related to the tools or environment' but relate to problems with the environments only as, for example, 'O/S bugs' (not problems in a dedicated test environment). Testing is often described as an important practice for continuous integration and delivery, for example, as described by Duvall [32] in '100% of tests must pass for every build'. In spite of that, Duvall only includes testing tools and testing frameworks (primarily for unit testing) in a list of 'testing resources'. In a similar way, Humble and Farley [8] recommend to 'manage your environment' but refer specifically to operating systems and software packages and not other aspects of a test environment. Other books provide a more complete approach, for example, Black [33] describes the four components of 'the test system' as 'test team', 'testware', 'test processes' and 'test environment'. The test environment is described as 'hardware, software, network infrastructure, office and lab space, chairs, tables, phones, and all the other items that make up the testing workplace'. Test tools, however, are in this definition labelled as 'testware' together with test cases and test data.
The responsibility to set up and maintain the test environment can be described both as up to the local team and as a common infrastructure. Gregory and Crispin [34] refer to test environments as something handled locally to secure test environment capabilities: 'Your team must make the investment so that you can effectively conduct automated and exploratory tests quickly and efficiently'. Whittaker et al. [35] take a somewhat opposite approach, referring to the test environment as 'a common infrastructure that performs the compilation, execution, analysis, storage, and reporting of tests has evolved'. Due to this, Google engineers can focus on 'writing test programs and submitting them to this common infrastructure to handle the execution details and ensuring that the test code gets the same treatment as functional code'.
Challenges related to test environments are primarily described as related to the construction of test facilities. For example, Gregory and Crispin [36] state that 'many agile teams struggle with creating and maintaining useful test environments' and suggest 'to have several test environments for different purposes', one of them 'a copy or at least a good representation of production'. However, Larman and Vodde [37] argue that a common assumption is that the creation of the test environment is the difficult part, but other important aspects are under-appreciated, for example, 'maintenance and evolution is more effort than initial creation'. In a similar way, Whittaker et al. [35] elaborate on the importance of testability: 'A Software Engineer in Test's first job is testability'.

| Summary and discussion
In summary, the literature review showed that test environments appear as a term in thousands of academic publications, but problems or success factors are often shallowly described or just mentioned in passing. Academic publications and books address different types of challenges or problem areas but often leaving out areas that other authors consider to be the core issues. Some publications describe problem areas or success factors for test environments, but in many cases in wordings such as 'test environment has to be organized and available'. The responsibility to set up and maintain the test environment is described both as up to the local team and as a common infrastructure. Challenges related to test environments are primarily described as related to the construction of test facilities, but some publications touch upon related areas such as maintenance and product testability.

| Interview series with one of the studied companies
The literature review (presented in Section 3) did not result in a comprehensive list of impediments or challenges related to test environments. In response to this, a set of primary interviews were conducted with one of the studied companies.
The purpose of this first series of interviews was to provide insights from a large number of interviewees in the same organization.
The primary interviews included 30 individuals from Company A, with on average 17.6 years of experience from the software industry (spanning from 2 to 32 years). Twenty of the interviewees worked in product development in the company, three interviewees with pre-studies for new products and seven worked with development and maintenance of test environments (as defined in Section 1.2). All interviewees were working with testing of the product (part-time or full-time), that is, had experiences from the company's test environments. Ten of the interviewees were identifying themselves primarily as developers, six as testers, six as technical managers and six as other roles (e.g., technical coordinator). The interview guide for the primary interviews included the following questions, designed to provide answers to the research question described in Section 1: • IQ1: Do you think your test environments currently enable efficient and effective system testing?
• IQ2: What are your major pain points related to testing and test environments, and what would you like to change to improve how your test environments support efficient and effective testing? • IQ3: What major trends (global trends or trends in your industry) will affect testing in your organization, and what do you need to change in your test environments to enable efficient and effective testing for future development projects? • IQ4: What challenges do you think must be taken into account when designing test environments for large-scale software systems, which also include interfaces to physical systems (e.g., mechanical and electronic systems)?

| Interviews with Company A
The interviewees from Company A generally had a quite positive attitude to their test environment, with 28 of the 30 individuals responding 'Agree' or 'Somewhat agree' to the question 'Do think your test environments currently enable efficient and effective system testing?' Two of the interviewees pointed out the difference between effective and efficient, both of them arguing that the test environments in their company were more effective than they were efficient. The interviewee responses on pain points and improvement areas covered many different areas. Test environment usability was described as an important improvement area by many interviewees. The interviewees were asking for better presentation of information such as simulator configuration, presentation of variables or parameters during the test, or a list of known problems. Another interviewee described that testing in the main test facilities was often conducted with many participants, representing different engineering disciplines in the product. This affected usability, as not everyone could see the same information. The interviewees also talked about how the system under test must be designed for testability. One of the interviewees who had been involved in design and implementation of the product's platform stated that 'the platform has no transparency, so debugging the system is very difficult'. This was mirrored by comments from engineers working with applications executing on the platform, for example, 'The product we are testing should be designed in a way to let us manipulate data and test fault injection'. Test environment stability was described as a problem, with regards to hardware configurations, as well as stabile software. One interviewee described, 'there is always something wrong when you change the hardware configuration', referring to problems with switching cables or setting parameters. Another interviewee referred to stability in the simulation software, explaining 'if one of the simulator models crashes, that should not stop the complete simulation'. Eighteen interviewees requested more simulator test functions, including various requests to test not only specific functions in the product but also fault injection for robustness testing. The interviewees pointed out problems related to executing the right test activity in the right test environment ('we test too much in test environments with real hardware'). Some of the interviewees asked for better support for automated testing, for example, as in 'scripted tests that can test all types of functions'. To remove startuptime for the actual test was described as an important improvement area. This was primarily described as 'faster software loading', but some interviewees also asked for 'automated startup of the test environment'. Some of the interviewees also described problems related to fast and simple data transfer to and from the test environments.
The interviewees also described many challenges for future development projects, coming from global trends or trends in their industry. Test environments for even more complex products were described as an upcoming challenge. One of the interviewees explained: 'With a more complex system, you must better visualize what really happened'. An accelerating pace of change was also described as a coming challenge for test environments. One interviewee explained the problem in the following way: 'The pace of change in the product and the test environment will be higher, so the test environment must be more user friendly'. Some of the interviewees believed the test environment must be able to continuously support new types of simulations, for example, 'AI systems will require much more CPU power'. Better models of physical systems were described as increasingly important. The interviewees described this as something related to saving money, for example, 'We must test more with simulator modelsreal hardware is too expensive'. The interviewees described an increased need for a capability to test a large-scale system based on AI and/or machine learning. The interviewees described not only how larger software systems need more automated tests but also how AI could change some types of testing: 'If the system is AI-based, we can automate also the end-user tests'. This, however, implicates new problems to solve (e.g., 'How do we evaluate the end-to-end test results if we do not have a user for the system?'). Nine interviewees also talked about how the test environments must enable test of system of systems. This relates primarily not only to integration of larger test environments ('connecting test environments') but also to problems related to multi-site environments.
Finally, the interviewees were asked about challenges related to large-scale and bespoke hardware. The interviewees described additional challenges when it comes to designing the right set of test environments, described by one of the interviewees as 'the right mix of real hardware and simulator models'. Some of the interviewees described construction of good models of physical systems as another additional challenge ('simulation of, e.g., how a liquid flows though the pipes is quite difficult'). Others talked about difficulties to integrate some types of hardware in a test environment. 'You cannot just test with models', one of the interviewees explained. Another participant pointed out 'timing with real hardware' as particularly difficult. Testing a large-scale system was described as a problem, as it is a problem in itself that many things should work together ('more interfaces means more things can go wrong'). Many interviewees described challenges related to presenting an overview of status in the complete system and the test environment for a large-scale system. Finally, locating the root cause of a problem in a large-scale system was described as an additional challenge. One of the interviewees described this as follows: 'symptom and root cause are often not in the same subsystem in a large product'.

| Summary and discussion
The responses from the interviewees included a large amount of statements and comments, which were coded and collated into themes.
• Eight main themes were identified from the interviewee responses about pain points and improvement areas (IQ2), each of the themes based on interview responses and comments from between 10 and 20 interviewees. • Six main themes were identified from the interviewee responses about challenges for future development projects (IQ3), each of the themes based on interview responses and comments from between five and nine interviewees. • Six main themes were identified from the interviewee responses about challenges related to large-scale and bespoke hardware (IQ4), each of the themes based on interview responses and comments from between five and eight interviewees.
A thematic network was constructed [17], resulting in a thematic map of problems and improvement areas for test environments. A simplified version of the thematic map is presented in Figure 2. As a final step, one or several representative quotes were selected from the transcripts included in the descriptions of each theme and used in the visualization of the thematic map.
F I G U R E 2 A simplified version of the thematic map of problems and improvement areas for test environments, based on the interview series with Company A 5 | FOCUS GROUPS

| Focus groups at each of the studied company
The next step of the study was a series of focus groups, including each of the studied companies (Company B, Company C, Company D and Company E). The purpose of the focus groups was to provide insights from companies in different industry segments and to compare the results from the primary interviews with Company A (presented in Section 4) with responses to the same questions from focus group participants from Company B to Company E. The focus groups included engineers with senior roles with regards to testing and test environments in each company, such as senior test leader. Two representatives from Company A were also invited to each focus group, purposely included to discuss differences and similarities between the companies (involved in the discussions after the initial responses from the other participants). The number of participants and the type of meeting used for each focus group (physical location or virtual Teams meeting) are presented in Table 1.
The participants in each focus group were asked the same questions as in the primary interviews (presented in Section 4.1), that is, rating their test environments, pain points and what they wanted to change, major trends and how to enable future development projects and additional challenges coming from interfaces to physical systems. After the first responses to each question, the results for the same question in the primary interview (with Company A) were presented, one question at a time, to enable further discussions on areas previously not touched upon.
In response to the need for secondary studies related to test environments (described by, e.g., Garousi and Mäntylä [11]), we have included detailed descriptions of the results from the focus groups, intended primarily for researchers conducting secondary studies and allowing replication of this study.

| Focus group with Company B
The participants from Company B expressed somewhat mixed feelings for their test environments. The participants described that the test environments in a good way supported automated unit and component test but did not support system testing of the complete product in the same way. Especially automated testing of the complete product was described as a problem area. One of the focus group participants described the root cause as 'no one is responsible'.
The discussion around pain points and improvement areas tended to evolve primarily around system architecture for the system under test and ways of working in the development organization. One of the participants stated that 'the software architecture for the product defines the test environments' and explained further that 'it is easier to test an APItesting a graphical interface is expensive'. Another participant questioned the current way of working: 'The development teams might be too independentthis might be an obstacle for good testing of the complete system'. The discussion also touched upon overall questions such as 'what is a feature' and 'what is the system'. The participants from Company B did not so much relate to problems with simulator models of physical systems and explained this with that their hardware (in comparison) was not very expensive. However, a need for models was recognized for scalability tests including a lot of devices. Simulator stability related to hardware was considered to be a problem, as well as test environment usability ('sometimes it is difficult to do things right'). One participant described 'we do some test only just before release to customer, but the reason is that it is difficultnot that we do not need to test more often'. Some capabilities were missing, for example, the capability to provide correct test data to the sensors. The participants also described visualization of test results as related to test environment usability. The focus group participants recognized the problem with the right test in the right test environment and described portable tests (not depending on a particular test environment) as a solution. A long startup time and data transfer and was considered to no longer being a problem at Company B, but it might be (again) in the future due to increased amount of data with machine learning.
The most important challenges for future development projects were described as system of systems and increased complexity. The participants discussed testing with other systems, and especially testing in a cloud environment. The major source of increased complexity was described as increased level of sensor fusion. AI and machine learning were described as trends that already affected the company's ways of working. To use AI to test the system was described as not only an interesting field but also raising many new questions ('how do we know that it works'). Cyber security was also described as a major trend that affected the company, driving a need for continuous deployment to handle cyber threats. An increased pace of change was described as 'no longer a problem', as the organization had 'worked a lot with this'. We also observed that Company B did not describe new technologies as disruptive game changers, but instead something natural in their ways of working ('we have new things all the time'). The Company B focus group also discussed challenges related to large-scale and bespoke hardware. The participants described difficulties testing the complete product. One participant even stated 'we do not test the complete product in a structured way'. The workshop participants agreed on that symptoms and root cause for a problem might be in different parts of the product, described, for example, as 'problems ripple through the whole system'. The workshop participants described switching between hardware configurations as a challenge for a system with bespoke hardware, but also practical problems such as cables and cooling air for the hardware could be a considerable problem.

| Focus group with Company C
The participants from Company C discussed the question 'Do think your test environments currently enable efficient and effective system testing?' and seemed to agree on that the company's test environments were probably quite effective, but not efficient to the same degree. One of the focus group participants argued that they had good test environments for testing of a software module, but not so much for testing of the complete system. Another individual stated: 'We use the test environments in an inefficient way due to integration problems'. Testability for the system under test and removing long start-up time were described as the most important improvement areas. The participants described testability for the system under test as a big problem, more important than many problems in the actual test environment. The participants described an ongoing transfer to a new product platform and described testability for the new platform as 'better', compared to the old platform (described as 'not so good'). Availability for test environments was described as another major pain point, especially as the need for test environments had increased due to more product variants. To quote one of the focus group participants: 'We increased the amount of software, but we did not increase the test capacity'. Another individual described his view of the root cause: 'We do not have enough funding for the test environments. Some managers question if this is important, since we have the real [vehicle]'. Other participants agreed on this, with comments such as 'we are under-funded'. One of the participants stated that 'testing takes long time' and included time to install software in the test environment. The focus group participants also described that they used the test environment 'in an inefficient way' and further explained that 'testers do not use test environments they are not familiar with'. Someone else described how 'some people wait for the real [vehicle]' as they did not trust testing with models of the physical systems. This, however, also caused problems, as availability of hardware components for the rigs was a problem. The participants described a need for test environments for non-functional requirements, for example, robustness and capacity, and described that this was not solved since 'the people building test environments do not know how the testers want to use the test environments'. The participants also described that the company used different test frameworks for different subsystems 'due to legacy reasons', going back to when the product was primary based on integration of hardware components (now implemented in software). Different, highly independent parts of the organization were responsible for the different subsystems ('we have kingdoms') including responsibility for the test environments. This was described as a major problem, affecting 'the quality perceived by the customer', especially for functions spanning over several 'kingdoms' in the organization.
The focus group also discussed challenges for future development projects. Autonomous systems were described as a driver for new type of simulation of scenarios. Other trends discussed were new types of partnerships and a larger amounts of data. More complex products, new types of simulation and AI/ML were described as the most important challenges. More complex products were based on a need to integrate 'new types of functions' in the product. This would also imply new types of simulation due to digitalization of previously analogue systems and functions. System of systems were also discussed as an emerging field, causing new types of questions, such as 'Who is responsible for the system?' and 'How do we handle cyber-security?' The Company C focus group also discussed challenges related to large-scale and bespoke hardware. The participants described scale in itself as a problem, for example, 'it is tricky to construct the system from an autonomy, to see the dependencies'. This also affected test environments, as it is difficult for the tester to 'see dependencies and to see the root cause of a problem'. As a solution, one workshop participant asked for a 'test environments which shows the status of the complete product'. Bespoke hardware could, according to the workshop participants, be a problem due to a lack of testability for the hardware systems. Interfaces to hardware also affect the test environment. To quote one of the participants: 'We need to make decisions on test environments early due to hardware dependencies'.

| Focus group with Company D
At first, the participants from Company D seemed to have almost opposite opinions on if their test environments enabled efficient and effective system testing. However, after a longer discussion involving all the participants from Company D, the focus group participants agreed on that the answer depended on many things, for example, effectiveness and efficiency were better in some test environments than in others and better for testing the subsystems than testing the complete product.
When the participants in the focus group were asked about pain points and improvement areas, they started with problems related to ways of working and organization: One individual described their ways of working as 'effective but maybe not efficient', questioning the efficiency in the test scope ('what should we test?'). The discussion evolved to how the company used 'different systems for testing', causing disarray, inefficiency and confusion. The participants described that another problematic cause of disarray was high staff turnover and unexperienced engineers. Due to this, test environment usability was not described as a problem, but instead to 'have more people understand what is available'. One individual described their 'three most important problems' as 'competence problems, competence problems and competence problems'. The participants agreed on that this was primarily a problem for the subsystems, as the organization testing the complete product had more experienced engineers. This problem resulted in that problems that should be found on subsystem level slipped to test activities for the complete product. The focus group participants also described problems in the actual test environments, for example, Human Machine Interface (HMI) testing was described as 'a weakness', as well as 'efficient testing of product variants'. Stability was described as a problem, more in hardware-in-the-loop simulators than in software-in-the-loop simulators. Long execution-time for the tests was described as another problem (however not the startup-time). The solution was described as going from hardware-inthe-loop simulators to software-in-the-loop, which required good models of the physical systems. Lack of testability was also described as a major problem by the focus group participants: One individual even stated 'we have not done one single thing to make the vehicle testable'. Another participant agreed 'one hundred percent', explaining how 'testability is not prioritized' and 'you must design the product in a way so you can test it'. Better support for automated testing was not described as a problem. Neither was data transfer, but instead the participants described an emerging problem to structure their data.
The focus group also discussed challenges for future development projects, identifying automated vehicles, electrification and the transformation from products to services as the most important trends. Automated vehicles were described as a source of increased complexity in the product. This caused an increased test scope, which only could be handled with more test activities in the customer's environment. However, this also meant an increased in-house test scope and more testing on simulators and models as, to quote one of the participants 'you cannot just throw some [untested] functions to the customer when it is about autonomous driving'. Automation and AI also raised new questions, such as 'how do we know what is right or wrong in an AI-based system'. The Company D participants also described how they there were in a transformation from product to service, which also implied testing of system of systems: 'the product is a part of the customer's solution', which must be included in the test environment. Due to this changed business model, the company needed to increase the pace of change in the product to 'be able to add new value fast'. This had also changed how management looked at test environments, going from more or less seeing them as an unwanted cost to a valuable asset. To quote one of the participants: 'Now we have lots of money'. The transformation also included going to a more agile way of working, including distributed responsibility for most of the test environments. This meant clearer responsibility in many ways, but missing coordination and overall strategy for test environments. Someone else expanded on this with 'continuous integration and delivery increases the need for test environments, and we are not in pace with that'.
Finally, the Company D focus group also discussed challenges related to large-scale and bespoke hardware. The participants described the problem of large scale simply as problems to 'make many things work together'. A challenge related to bespoke hardware was to have 'testable hardware', that is, to be able to observe what is happening in the mechanical or electronic systems. To integrate different types of models (possibly from different subcontractors) was also described as a problem, and to connect those models to the working software in the product. Simulation of the environment for the product was also described as an additional challenge. The Company D representatives responded they recognized all the problem areas derived from the interviews at Company A. Especially to 'design the right set of test environments' was described as important.

| Focus group with Company E
The participants from Company E expressed generally positive feelings for their test environments. The participants seemed to agree on that test environments used by the developers should be flexible (e.g., allowing manipulation of data) and test environments for end-to-end testing should be similar to the customer's environment. The focus group participants also discussed how efficient and effective could be each other's opposite, for example, as effective testing could mean a need to run the same test on all available hardware configurations-which arguably is not very efficient.
When the participants from Company discussed pain points and improvement areas, they focused very much on speed and feedback loops, that is, shortening the time from when a problem is identified in a test to when a developer can correct the software. To quote one of the participants: 'To come to the test takes a long time, and getting the feedback back to the developers also takes too long. It's about collecting and visualizing all data, which is a technical problem. But it is also about sending the problem to the right design team. But who is responsible if you find the problem after a week?' Another participants asked for better test management as a solution and included visualization and traceability in this. Other topics related to speed and pace of change were deploying software to the hardware takes too long, testing on many hardware configurations and long lead time to make changes in the test environment ('very time-consuming'). The participants also discussed other issues related to ways of working, for example, one individual stated 'we focus on the tests going passedbut not so much on what we should do if we find a new problem'. Another participant asked for 'engineers who understand the product, so they can write test cases'. The focus group participants described an ongoing major change due to that the company was moving into cloud-based services, with 'the promise to be independent from hardware'. Despite this, testing was (according to the participants) still very depending on the right hardware resources. One individual explained: 'we have very hard coupling between the layers in the product, from hardware to application'. This increased the test scope, due to a need to run the same test on different hardware configurations, and with different settings. Another topic was testability, which was lively discussed. One participant stated that 'the test cases are depending on the architecture'. Another commented 'the test levels come directly from the architecture', asking for better testability. One individual, however, had the somewhat opposite opinion, claiming that 'we do not want a product the tester likes, we want a product the customer likes'. Long update time on test tools was also described as causing problem to test all types of functions. The distributed responsibility for test environments was also described as problematic ('you optimize [the test environment] for your organization, not for the product'). This was also described as positive due to a more clear responsibility, but a negative as problems slip to test activities later in the test chain. Support for test automation was not described as a problem. Test flakiness was previously a problem, but now stability problems were not seen as a big problem.
Company E challenges for future development projects were describes as primarily coming from the ongoing transformation to cloud-based services. 'To trouble-shoot with cloud is more difficult' according to one of the workshop participants. Another individual explained: 'you are never completely free from hardware'. An additional challenge coming from cloud was, according to the participants, a need to test on a larger number of hardware configurations (as more alternatives were allowed) but still within the same budget. Other trends that affected the company were increased demand for robustness and increased need for performance. However, one individual added 'there are lots of new words and I hope system management keeps up with what they mean for the product', indicating that new trends and buzzwords were arriving all the time. AI was not seen as a major upcoming change ('we have not seen anything so far'). One participant, however, described AI as 'this will be a bigger thing for us in the future'. In the same way, testing system of systems was not described as an upcoming change, but instead an ongoing problem-often a problem related to organization.
When the Company E focus group discussed challenges related to large-scale and bespoke hardware, they returned to cloud, and how 'cloud promised to be independent from hardware'. The result was, according to one participant: 'we are still depending on specific hardware, but we have less control and more to test with multiple hardware configurations'. Other individuals filled in on how a new allowed 'hardware track' meant increased test scope, and how especially performance testing was depending on testing on the right hardware. One participant explained that new hardware often were expensive, and sometimes without the middle-layer software (e.g., from Intel). To quote another participant: 'Test environments are becoming more and more expensive'. The result was a lower number of test facilities and therefore lower test capability-which was hard to combine with an increased test scope with multiple hardware configurations. Problems with large scale were described much as a problem related to the scale of the organization, for example, 'if you integrate large systems there will be a lot of software, maybe from 50 design teamsso who should you notify if you find a new problem?' The focus group also returned to testability and seemed to agree on how 'the subsystem with the least analyzability defines testability for the complete system'.

| Summary and discussion
During the workshops with the focus groups, we found that to a large extent the companies shared a common view of problems and challenges. We also observed how participants in all of the focus groups could relate to the pain points and improvement areas derived from the primary interviews with Company A. However, we also identified that in many cases the different companies used slightly different terminology, or described the problem in a somewhat different way. In particular, the boundary between current problems and problems for future development project was in some cases not the same in the studied companies.
We interpret this as that a new trend does not affect a company as being struck by lightning but instead gradually influence new technologies and new ways of working, and companies may perceive this differently based on industry segment and the company's business strategy. Due to this, we find that the list of improvement areas from the primary interviews with Company A (presented in Section 4.3) should not be seen as generalizable to all types of large-scale and complex software systems.
Instead, from the results from the focus groups with all the studied companies and the results from the primary interviews, another pattern emerges with intrinsic and extrinsic success factors for test environments (described in Section 6.1). As the next step of the study, the results from the analysis of the interviews and focus groups (described in Sections 6.1-6.3) were presented at a cross-company workshop (described in Section 6.4) to further strengthen the reliability and generalizability of the findings.

| Intrinsic and extrinsic success factors
Statements and comments from the focus group participants were transcribed in a similar way as during the primary interviews. The material was combined with the results from the primary interviews and was analyzed with thematic coding analysis [17], resulting in a list of intrinsic and extrinsic success factors for test environments. Intrinsic success factors include characteristics and capabilities existing within a test environment, which enable efficient and effective testing of large-scale software systems (e.g., the capability to drive around if the product is a vehicle). Extrinsic success factors are properties not inherent to the test environment, but still vital for a successfully implemented test environment (e.g., limited debug information exposed to the tester due to architectural restrictions in the system under test). These external dependencies might not be evident for all affected stakeholders, that is, an organization designing and implementing a test environment must actively ensure and guarantee its intrinsic success factors and its extrinsic success factors, in order to be able to provide test environments enabling efficient and effective testing of large-scale software systems. The intrinsic and extrinsic success factors for test environments are visualized in a thematic map [17] with themes and subthemes, presented in Figure 3. The labels (A, B, C, D and E) in the figure represent the companies in the study (e.g., Company A) behind each subtheme, based on the statements and comments from the company representatives, as described in Sections 4 and 5. The intrinsic and extrinsic success factors are further described in Sections 6.2-6.3.
F I G U R E 3 Intrinsic and extrinsic success factors for test environments that enable efficient and effective testing. The labels (A, B, C etc.) represent the companies in the study (Company A-Company E).

| Intrinsic success factors for test environments
Test environments enabling efficient and effective testing of large-scale software systems are depending on the following intrinsic success factors: • Test environment capabilities: The test environment's ability to test all functions or other aspects in the system under test. • Test environment usability: The test environment's ease-of-use as perceived by the tester, including the time needed to accomplish a task. • Test environment stability: The test environment's reliability and quality over time for the expected capabilities.
Test environment capabilities is the test environment's ability to test all functions or other aspects in the system under test. A quote from Company A demonstrates an example of a missing capability: 'I cannot test with the emergency battery'. Test environment capabilities also includes the capability to test non-functional requirements such as robustness and capacity (as described by Company C). The companies also pointed out other weak spots, such as capability to test cyber security (Company C), capability to test HMI (Company D), capability to provide test data to the sensors (Company B) or access to the real bespoke hardware (Company A, Company C and Company E). All companies described AI and autonomy as technologies requesting new capabilities in test environments (with some of the participants in the Company E focus group a bit more doubtful). Four of the five companies (Company B-Company D) described that they did not have a sufficient capability to test the complete product in comparison with how they tested their subsystems. Company B particularly described automated testing of the complete product as a problem area.
Test environment usability is the test environment's ease-of-use as perceived by the tester, including the time needed to accomplish a task. For example, the quote 'sometimes it is difficult to do things right' (from Company B) or 'so many settings in the test environment can be wrong' (from Company A) demonstrates requests to make the test environment more user friendly. Company A interviewees ask for better presentation of simulator configuration, presentation of variables or parameters during the test and a list of known problems. Company A also includes usability for a group of testers, testing together in a test environment, that is, providing multiple displays presenting all relevant information. Company B, Company D and Company E describe collecting and visualizing data and test results as an expanding problem area. Three of the five companies (Company A-Company C) describe a need to show status in the whole system in order to be able to trouble-shoot complex problems. Company A, Company C and Company E describe a long start-up time for the actual test as a problem, which previously also was a problem for Company B. Company D did not describe test environment usability as a major problem, but explained how this was due to that the company had a bigger problem to 'have more people understand what is available'.
Test environment stability is the reliability and quality over time for the expected capabilities. For example, Company D describe 'stability in the hardware-in-the-loop simulators' as a problem, demonstrating how reliability of the hardware installed in the test environment affects test efficiency. Stability problems are reported by Company A, B and D as primarily related to hardware, for example, 'there is always something wrong when you change the hardware configuration' (Company A). Plausible explanations are a lack of stability in early prototypes of new hardware installed in the test environment or a lack of automated switching between hardware configurations (eliminating human errors). Company A and Company D also describe stability problems in the software as a problem, for example, 'if one of the simulator models crash, that should not stop the complete simulation' (Company A). Company E describe how test flakiness was previously a problem (with problems in the test environment as one of several root causes). However, stability problems were not any longer seen as a big problem in Company E.

| Extrinsic success factors for test environments
Test environments enabling efficient and effective testing of large-scale software systems are depending on the following extrinsic success factors: • Product testability: The degree to which the system under test supports testing in all relevant test contexts.
• Test organization: How the organization supports testing in the test environments.
• Business strategy: The actions and decisions a company plans to take to reach its goals and objectives that also affect the test environments.
Product testability is the degree to which the system under test supports testing in all relevant test contexts. The quote 'the platform has no transparency, so debugging the system is very difficult' (Company A) is an example of perceived low testability, specifically low observability in the system. Company C and D describe similar problems with testability for hardware systems. Three companies (Company B, Company C and Company E) describe how the product's architecture affects testability. The architecture affects observability (the degree to which it is possible to test results) and controllability (the degree to which it is possible to control states and input to the system under test). In an architecture with a high degree of separation of concerns, each component in the system has a single, well-defined responsibility. Therefore, troubleshooting complex problems is more difficult in a system with a low degree of separation of concerns, for example, showing in the comment 'problems ripple through the whole system' (Company B). Company A and Company C provide descriptions of similar problems. Company A and Company B describe a need for better support for automated testing, in particular automated testing of the complete product. This does not only relate to the test environment itself but also includes product automatability, that is, the degree to which it is possible to automate testing of the system under test. One comment from Company D stands out as an urgent call for improved product testability: 'We have not done one single thing to make the vehicle testable'.
Test organization is about how the organization supports testing in the test environments. The quotes 'you optimize [the test environment] for your organization, not for the product' and 'this leads to slippage to test activities later in the test chain' (Company E) exemplify how the organizational structure in the company is clearly related to the efficiency in the test environments. Four companies (Company B-Company E) describe a distributed responsibility as a problem, which even might affect 'the quality perceived by the customer' (Company C). However, two of these companies (Company D and Company E) also describe a distributed responsibility for the test environments as something positive due to a more clear responsibility. Company A, having a more centralized organization for development and maintenance of the test environments, did not describe problems directly related to organizational structure in the same way as the other companies. Three companies (Company C-Company E) describe how problems with process and ways of working are causing inefficiency, for example, 'we use the test environments in an inefficient way due to integration problems' (Company C). Company E describe how they not just optimize on the time to execute the test in the test environment but time to get the feedback back to the developers. Company D describe competence as their primary problem, and how they need to 'have more people understand what is available'. Company C describe a similar competence problem, as 'testers do not use test environments they are not familiar with'.
Business strategy is an important extrinsic success factor, as the actions and decisions a company plans to take to reach its goals and objectives also affect the test environments. Company D describe how their new business model had changed how management looked at test environments, going from an unwanted cost to a valuable asset. The observations from Company C paint a different picture, with comments such as 'we are under-funded' and 'some managers question if this is important'. We also find other examples on how the strategy for the test environments do not seem to be aligned with the company's business strategy: Three of the companies (Company C-Company E) describe how increased focus on continuous practices and feedback loops has caused an increased need for test environments, but the companies are not always 'in pace' with that. Company C-Company E also describe how business decisions to expand the number of product variants or available hardware configurations lead to an increased test scope, not always accompanied with extended funding of the test environments. All five companies describe how future partnerships and collaboration with the customer will increase the need for testing of system of systems in the test environments. The company's selected industry segment also affects test environments, for example, complex data transfer due to security aspects (Company A) or cyber security driving a need for continuous deployment (Company B).

| Cross-company workshop with five companies
As a complement to the interviews and the focus groups with the five studied companies (presented in Sections 4 and 5), we facilitated a cross-company workshop including 30 participants from all five companies in the study. The participants had roles in their companies as senior tester, test specialist or line manager (for a group of testers).
At the cross-company workshop, the researchers presented the results from the study, summarizing the setup for the primary interviews and the focus groups (presented in Sections 4 and 5) and presenting the intrinsic and extrinsic success factors (presented in Sections 6.1-6.3). The intrinsic and extrinsic success factors were very well received by the representatives from the five companies, with comments such as 'this makes perfect sense' and 'you have summarized it beautifully'. One participant asked 'Why has this not been done before?', which started a discussion among the workshop participants. The participants seemed to agree on that a plausible explanation was that focus in previous research has been on systems fully implemented in software, that is, not including any electronic and mechanical systems. Another workshop participant was missing 'test environment availability' in the list of success factors, that is, enough test environments available to meet the demand from the testers. Other participants at the workshop then argued that availability should not be seen as general availability of a test environment, but instead availability of the capabilities in the test environments. One individual asked for credibility for models of physical systems as another important success factor, possibly merged with the intrinsic success factor Test environment stability. This comment was supported by participants representing two other companies, but representatives from the two other companies seemed to disagree or not fully agree.
The workshop participants also discussed why the focus group participants from Company C had not discussed the success factor Test environment stability. However, one of the participants from Company C stated that 'We have this problem, especially there are problems for automated testing'. In the same way, it was discussed why Company A is missing under Test organization. One participant from Company A stated 'Competence is definitely a problem'. Another participant from Company A explained this partly as 'We have separate organization responsible for test environments, with a clear responsibility'. We interpret this as that all six intrinsic and extrinsic success factors are valid for all five companies in the study.
After the presentation of the results from the study (the intrinsic and extrinsic success factors), the workshop participants were divided into four groups, allowing the participants to discuss the presented results. The representatives from the five companies were sorted into the four groups, forming two groups with seven individuals and two groups with eight individuals. Each group was asked to discuss if any of the presented success factors were interdependent, that is, where and how should you start working with the factors. The groups were also asked to discuss how responsibility for handling the intrinsic and extrinsic success factors should be assigned. The breakout sessions in smaller groups were followed by a summarizing session with all 30 participants, discussing the findings from each group. At the summarizing session, all four groups described that the best way to work with the factors is to start with the extrinsic success factors: • Group 1 phrased this as 'start with extrinsic-the product and the organization provide the boundaries for the test environments'. Group 1 also presented an order for the extrinsic success factors: 'first business strategy, then product architecture, then organization'. • Group 2 described this in a similar way as 'start with the business case, it should be weird to start with the test environment'. • Group 3 simply stated 'start on the right-hand side' (in the visualization in Figure 3) and exemplified this with how observability and controllability were prerequisites for functional capabilities in the test environment. • Group 4 had a more detailed description: 'start with the business case, then product and organization, then intrinsic factors', but added that there was 'no perfect recipe', that is, you must in many cases go back and forth between the factors.
The groups did not reach consensus in the same way for the question on responsibility for handling the intrinsic and extrinsic success factors. Instead, the groups presented general statements such as 'someone must be responsible'. The groups seemed to agree on that responsibility for the test environments could be central or distributed, with pros and cons depending on the context. Group 3 argued that responsibility should be within a separate part of the organization, responsible for optimization of all test environments. Group 4 argued that it would be difficult to have a top-down approach, to have one part of the organization that 'need to know everything'. Group 1 described how 'testers must be able to provide input', and this must be solved both with central and distributed responsibility. However, all four groups agreed on that test environments must evolve continuously to keep up with the pace of change in the company. Several of the participants talked about how 'a test environment is a product of its own' and must be treated as such. All five companies described that this was also how they 'tried' to work, but not always succeeded. The companies described not only examples on parts of their companies where this worked well but also examples where it did not work very well. To quote one of the participants: 'This is a complete mess in some parts of the company'.

| Discussion and comparison with the literature review
The publications identified in the literature review (presented in Section 3) describe problems or provide recommendations mirroring some aspects of the intrinsic success factors identified in this study: Ramler and Gmeiner [22] describe hardware availability as important to provide functional capabilities. Breivold and Sandström [23] touch upon test environment usability as they describe the need for flexibility of test environment setup and fault reproduction for debugging. Laukkanen et al. [28] describe test flakiness, corresponding to test environment stability. In a similar way, some of the identified publications are mirroring aspects of the extrinsic success factors: Whittaker et al. [35] discuss on the importance of product testability. Gregory and Crispin [34] and Whittaker et al. [35] describe somewhat opposite approaches for the responsibility in the organization to set up and maintain the test environment, and Larman and Vodde [37] touch upon maintenance and evolution of test environments. However, in the literature review, we found no publication providing a holistic approach. Instead, published work tends to focus on one aspect and is leaving out areas other authors consider to be the core issues. We argue that this also supports the novelty of the findings in this study.
The intrinsic success factors describe characteristics and capabilities existing within a test environment and are therefore a representation similar to what is described with quality models like ISO25010 that have similar attributes. This provided an interesting field of further work to contrast the results from this study with quality models like ISO25010 (further described in Section 8.2). Such a study could also compare the results from this study with other publications (not included in the literature review), for example, Wang et al. [38] touch upon desired characteristics for a test environment and present approaches to set up and improve test environments.
This study takes a holistic approach, including both effectiveness and efficiency in the research question (presented in Section 1). This broad perspective did not seem to be a problem for the interviewees and focus group participants. Instead, participants from four of the five companies pointed out the difference between effective and efficient. Participants from Company A, Company C and Company D all argued that the test environments in their company were more effective than they were efficient. The participants from Company E raised a similar question, discussing how efficient and effective could be each other's opposite, for example, as effective testing could mean a need to run the same test on all available hardware configurations-which arguably is not very efficient.
The research question (presented in Section 1) was purposely phrased as efficient and effective testing of large-scale software systems, that is, not efficient and effective test environments (as it is the testing activities that add value to the company-not the test environment itself). However, this raises the question if the efficiency and effectiveness in the test activities can be mapped to the test environment, or if it is also related to other challenges related to testing. As this question is outside the scope the study reported here, we refer to other related studies' general testing challenges: Challenges in testing of automotive systems and telecommunication systems are reported by, for example, Garousi et al. [39]. Ali et al. [40] report challenges related to testing of highly complex system of systems based on a study at a development site of a large telecom vendor. Garousi et al. [41] analyze embedded software testing challenges and provide recommendations to choose the right test techniques and approaches.

| Summary and discussion
Based on the primary interviews and the focus groups (reported in Sections 4 and 5), we identified a set of intrinsic and extrinsic success factors for test environments. Intrinsic success factors include characteristics and capabilities existing within a test environment (e.g., the capability to drive around if the product is a vehicle). Extrinsic success factors are properties not inherent to the test environment, but still vital for a successfully implemented test environment (e.g., limited debug information exposed to the tester due to architectural restrictions in the system under test). The analysis of the interviews and focus groups showed that even though all five companies included in the study describe themselves as software companies, challenges in the test environment are to a great extent related to hardware systems. Even after migrating to cloud solutions, test environments are still depending on specific hardware, in particular with regards to performance testing.
All four sub-groups at the cross-company workshop (reported in Section 6.4) came to the conclusion that successful implementations of test environments for large-scale software systems depend primarily on how they support the company's business strategy, test organization and product testability (the extrinsic success factors). Based on this, test environments can then be optimized to improve test environment capabilities, usability and stability (the intrinsic success factors). For example, test environment usability (an intrinsic success factor) cannot be optimized without understanding product testability (an extrinsic success factor) for the system under test. That is, an integrated system with a low degree of separation of concerns requires better support from the test environment to present status and visualize data during troubleshooting, which is likely to be a solution over the top for a more federated system (i.e., money better spent otherwise).
Based on the discussions at the cross-company workshop (reported in Section 6.4), we find that test environments are a vital part of the continuous integration and delivery pipeline and must evolve in the same way as new test cases are continuously added to the pipeline. Aligning the test environments with the company's business strategy, test organization and product testability is not a secluded activity, for example, something handled only in the initial phase of a new project. Instead, designing and maintaining an effective test environment must be seen as a continuously ongoing task to keep up with the pace of change in the company, carefully monitoring that, for example, new functions or subsystems in the product are designed to enable product testability, and compensating for new limitations in testability with improvements in test environment capabilities, usability and stability. The responsibility for this continuously ongoing task could be placed on the development teams introducing new functions or subsystems or placed on an organization developing the test environment. A clear responsibility for this task seems to be crucial for successful implementations of efficient and effective test environments for large-scale software systems. 7 | THREATS TO VALIDITY

| Threats to construct validity
The fact that the starting point for the literature review was the search string ['test environment' AND 'systematic literature review'] instead of ['test environment'] could be seen as a threat to construct validity. However, the literature review was primarily used to motivate a continued research study and not used as the source for the intrinsic and extrinsic success factors. As the decision to not conduct a complete and exhaustive review is clearly described in Section 2.2, we consider this threat to be mitigated.
Other threats to construct validity are related to the interviews and focus groups described in Sections 4 and 5: It is plausible that a different set of questions and a different context for the interviews or the focus groups can lead to a different focus in the participants' responses. In order to handle these threats against construct validity, the interview guides were designed with open-ended questions, and the interviewees and the focus group participants were selected as good informants with appropriate roles in the companies (following the guidelines from Robson and McCartan [17]). The questions in the interview guide were reviewed by all three researchers to improve the reliability of the interview questions. IQ2 was purposely phrased to include both positive and negative aspects. A potential threat to validity is that responses to IQ4 by engineers not directly involved in constructing or maintaining test environments could be regarded speculative. Such engineers had, however, substantial experience from using those test environments. Therefore, we consider their perspectives complementary, adding an important perspective to the study.
In this paper, we also present background material for both the interviewees and the studied companies in order to provide as much information as possible about the context and enable reproducibility of the study, as well as secondary studies based on the results from this study. The literature review was not conducted as a systematic literature review (according to guidelines from, e.g., Kitchenham [42]) but instead optimized to provide a better understanding of previously published literature related to the research question, informing the setup for the interviews and the focus groups. Due to this, and in particular the selection of Google Scholar as search engine over, for example, a Scopus search, the literature review is difficult to reproduce. As the success factors presented in Section 6 are based on the interviews and the focus groups (presented in Sections 3 and 4) and not on the results from the literature review, we do not consider this as a threat to the reliability of the study.

| Threats to internal validity
Of the 12 threats to internal validity listed by Cook et al. [43], we consider Selection, Ambiguity about causal direction and Compensatory rivalry relevant to this work: • Selection: All interviewees, focus group participants and participants in the cross-company workshop were purposively sampled in line with the guidelines for qualitative data appropriateness given by Robson and McCartan [17]. The selection of interviewees and focus group participants was informed by senior experts on testing and test environments in each company. Based on the rationale of these samplings and supported by Robson and McCartan who consider this type of sampling superior for this type of study in order to secure appropriateness, we consider this threat to be mitigated. As the primary interviews only included interviewees from one of the companies, this also implies a threat to validity related to selection. This threat was mitigated with the workshops with the other four companies, adding new perspectives from other industry segments. • Ambiguity about causal direction: While we in this study in some cases discuss relationships, we are very careful about making statements regarding causation. Statements that include cause and effect are collected from the interviews and the focus groups and not introduced in the interpretation of the data. The conclusions on interdependent success factors (the best way to work with the factors) are based on the responses from the four independent groups at the cross-company workshop, as described in Section 6.4. • Compensatory rivalry: When performing interviews and comparing scores or performance, the threat of compensatory rivalry must always be considered. The questions for the interviews and the focus groups (described in Sections 4 and 5) were deliberately designed to be value neutral for the participants, and not judging performance or skills of the interviewee or the interviewee's organization. Generally, the questions were also designed to be opened-ended to avoid any type of bias and ensure answers that were open and accurate. However, our experience from previous work is that we found the interviewees more prone to self-criticism than to self-praise.

| Threats to external validity
The list of intrinsic and extrinsic success factors presented in Sections 6.1-6.3 was confirmed at a cross-company workshop with participants from the same companies as in the interviews and focus groups (as described in Sections 4 and 5). Due to this, it is conceivable that the findings from this study are only valid for these companies, or for companies that operate in the same industry segments and have similar characteristics (presented in Section 2). Because of the diverse nature of the five companies, the companies included in the study represent a good cross-section of the industry (as described in Section 2). Based on analytical generalization [12,44], it is reasonable to expect that the identified success factors are also relevant to a large segment of the software industry at large. However, we consider external validation in other companies (preferably in different industry segments) to be a natural suggestion for further work.
8 | CONCLUSION AND FURTHER WORK

| Conclusion
In this paper, we have presented insights from practitioners in five large-scale industry companies, describing impediments or challenges related to test environments (presented in Sections 4 and 5). The study includes a series of interviews with 30 individuals, a series of focus groups with in total 31 individuals and a cross-company workshop with 30 participants representing the five studied companies, developing large-scale and complex software systems. The analysis of the interviews and focus groups showed that even though all five companies included in the study describe themselves as software companies, challenges in the test environment are to a great extent related to hardware systems (further described in Section 6). Even after migrating to cloud solutions, test environments are still depending on specific hardware, in particular with regards to performance testing. In other words, there is no such thing as pure software-there is always a hardware dependency, even if the levels of abstraction may be more or less deep. What these findings suggest is that a high level of abstraction from hardware can be a mixed blessing, as it affords less control and increased configuration complexity in testing.
Based on the interviews and the workshops with the five companies, we identified a set of intrinsic and extrinsic success factors for test environments (presented in Sections 6.1-6.3), enabling efficient and effective testing of large-scale software systems: • Intrinsic success factors include characteristics and capabilities existing within a test environment: -Test environment capabilities: The test environment's ability to test all functions or other aspects in the system under test. -Test environment usability: The test environment's ease-of-use as perceived by the tester, including the time needed to accomplish a task. -Test environment stability: The test environment's reliability and quality over time for the expected capabilities.
• Extrinsic success factors are properties not inherent to the test environment, but still vital for a successfully implemented test environment: -Product testability: The degree to which the system under test supports testing in all relevant test contexts.
-Test organization: How the organization supports testing in the test environments.
-Business strategy: The actions and decisions a company plans to take to reach its goals and objectives that also affect the test environments.
The cross-company workshop (presented in Section 6.4) strengthened the reliability and generalizability of the findings in the study. We find that successful implementations of test environments for large-scale software systems depend primarily on how they support the company's business strategy, test organization and product testability (the extrinsic success factors). Based on this, test environments can then be optimized to improve test environment capabilities, usability and stability (the intrinsic success factors).
Test environments are a vital part of the continuous integration and delivery pipeline and must evolve in the same way as new test cases are continuously added to the pipeline (as described in Section 6.6). A clear responsibility for this continuously ongoing task (central or distributed) seems to be crucial for successful implementations of efficient and effective test environments for large-scale software systems.
The literature review (presented in Section 3) showed that 'test environment' appears as a term in thousands of academic publications, but problems or success factors are often shallowly described or just mentioned in passing. Academic publications and books address different types of challenges or problem areas but often leaving out areas that other authors consider to be the core issues. In contrast to this, the intrinsic and extrinsic success factors are based on a holistic approach, providing a structure that can help companies improve their test environments to enable efficient and effective testing of large-scale software systems. As the five companies included in this study operate in different industry segments, it is reasonable to expect that the intrinsic and extrinsic success factors for test environments can be applied to a large segment of the software industry to optimize test environments in each individual case.

| Further work
In addition to the results presented in the analysis and the conclusions, we believe that this study also opens up several interesting areas of future work. As the study reported in this paper does not include any external validation, this comes as a natural suggestion for further work (as described in Section 7.3). The external validation is preferably conducted in companies that operate in other industry segments than the companies in the primary study and could also include quantitative data to allow method and data triangulation.
The interviewees and focus group participants expressed many things they wanted to change in their company's test environments (as described in Sections 4 and 5). Further work could build on the results from this study and construct a method or a model that can help companies prioritize the most important improvements suggestions, based on cost and benefit. The method (or model) could then be validated in companies separate from the primary study. One suggestion is to base this method or model on established quality models to rank and prioritize the most important qualities to evaluate test environments.
Further studies could then follow the improvement initiatives in the participating companies to analyze and compare different solutions and how they improve how test environments can enable efficient and effective testing of largescale and complex software systems.
The literature review (presented in Section 3) identified no systematic literature review focusing primarily on test environments. This exposes a need for a study focusing primarily on a systematic literature review covering one or several aspects of test environments. In a similar way, Garousi and Mäntylä [11] describe a 'need for secondary studies in the areas of test-environment development and setup' (as described in Section 3.2). Such a study could also include a discussion on the identified papers and their applicability in different types of software development projects in industry.