What Support do Systematic Reviews Provide for Evidence-informed Teaching about Software Engineering Practice?

Background: The adoption of the evidence-based research paradigm by software engineering researchers has created a growing knowledge base provided by the outcomes from systematic reviews. 
 
Aim: We set out to identify and catalogue a sample of the knowledge provided by systematic reviews, to determine what support they can provide for an evidence-informed approach to teaching about software engineering practice. 
 
Method: We undertook a tertiary study (a mapping study of systematic reviews) covering the period to the end of 2015. We identified and catalogued those reviews that had findings or made recommendations that were considered relevant to teaching about industry practice. 
 
Results: We examined a sample of 276 systematic reviews, selecting 49 for which we could clearly identify practice-oriented findings and recommendations that were supported by the data analysis provided in the review. We have classified these against established software engineering education knowledge categories and discuss the extent and forms of knowledge provided for each category. 
 
Conclusion: While systematic reviews can provide knowledge that can inform teaching about practice, relatively few systematic reviews present the outcomes in a form suitable for this purpose. Using a suitable format for presenting a summary of outcomes could improve this. Additionally, the increasing number of published systematic reviews suggests that there is a need for greater coordination regarding the cataloguing of their findings and recommendations.


Introduction
Over the half-century since software engineering became recognised as a distinct sub-discipline of computing [64], a degree of consensus has emerged about what it encompasses [14], as well as about the skills and knowledge that are needed by software engineers. For the latter, the ACM and IEEE produced a set of curriculum guidelines in 2004 aimed at consolidating ideas about what a software engineer should acquire from an undergraduate education, and this was updated in 2015 after wide consultation across academia and industry [5].
However, although there is fairly general agreement about what a software engineer should know, much less attention has been given to how that knowledge might be obtained. Indeed, much of our knowledge is still based upon 'expert opinion', and although this is largely derived from experience, it lacks rigour as the foundation for what aspires to be an engineering discipline [49]. And, even when more systematically-acquired evidence is available, this does not necessarily mean that it will be readily accepted or adopted by practitioners [25,79].
This raises two related questions. The first is concerned with how rigorous knowledge about the effectiveness of software engineering procedures might be derived (that is, how can we identify what works or doesn't work, and under what conditions?). And then when we have such knowledge, how can it most usefully be used for educating students?
In many disciplines, the major source of such knowledge is practice-related research, which is usually derived from 'field studies' of the effects that arise from the use of some intervention. (For software engineering, the interventions might be the introduction of innovative technologies or processes, such as the use of agile practices.) In software engineering research, there has been increasing use of empirical studies as a means of obtaining knowledge about software engineering practice. A comparison of the characteristics of papers submitted to, and accepted by, the ICSE conferences in 2002 and 2016 shows a significant increase in the reporting of empirical studies and the use of empirical models [87]. In particular, while no papers reporting empirical studies were accepted in 2002, this category made up 30% of the accepted papers in 2016.
Researchers have also adopted the evidence-based paradigm as a means of aggregating the knowledge available from a set of 'primary' studies that investigate a given topic, based upon the use of the systematic review as its main tool [51]. This in turn has helped to create a growing knowledge base of research findings about software engineering procedures that should potentially be able to inform teaching (and hence, implicitly, inform practitioners). In [51], the authors suggested that adopting evidence-based software engineering (EBSE) would potentially provide: -"A common goal for individual researchers and research groups to ensure that their research is directed to the requirements of industry and other stakeholder groups." -"A means by which industry practitioners can make rational decisions about technology adoption." For the study reported here we consider teachers and students to be additional stakeholders. Teachers can be regarded as being direct beneficiaries, as such knowledge can lend appropriate authority to reinforce teaching about software engineering topics. We view students as being indirect stakeholders, largely benefiting through the material presented by their teachers, rather than through direct use of the findings from systematic reviews.
To set this paper into context, we explain here how it originated and how it relates to other analyses that we have published. As experienced teachers, we wondered whether knowledge derived from the use of EBSE might be used in support of our teaching about software development practices. We envisaged that this support would have a number of forms, but our main expectation was that they might provide some authoritative support for the use of particular practices, or at least, an indication of when these were likely to be effective (or otherwise). In addition we expected that we might obtain some examples from experience about how or when to adopt new technologies.
In order to identify the extent and forms of knowledge about practice that was available, we originally undertook a study of a sample of the systematic reviews that were available up to the middle of 2011, selected on the basis that their topics related to practice, with our findings being reported in [20]. Although that study identified a set of potentially useful systematic reviews, in trying to use these to inform our teaching, we realised that they rarely presented their findings in a readily-usable form. So, beginning in 2016, we undertook a further study (reported here) that extended the earlier one in two ways. Firstly, we included systematic reviews published to the end of 2015, so including more reviews that were undertaken when their form had become more established. Secondly, we have performed a more comprehensive process of selection and analysis, requiring that a review should not only cover a topic relevant to practice, but also provide topic-related findings that were supported by its analysis of the available data.
While conducting this review, the problems encountered in identifying both relevant information about the processes followed in the reviews, as well as about their findings, led us to use our material to analyse and report on the ways that systematic reviews in software engineering were being reported [18] before writing a summary of our findings (this paper). Our aim was to persuade authors and reviewers of the urgent need to improve the quality of published reviews.
A separate question that arose was how extensively practitioners formed the participants in the primary studies used in our set of 49 systematic reviews and to what extent these were conducted in an industry setting ('field studies')? Addressing this involved further additional data extraction, with the outcomes reported in [19]. We do discuss some of the findings from this analysis later, as they provide useful supplemental information about the context of the knowledge available from the systematic reviews.
We begin by examining the evidence-based paradigm and the way that this has been employed by other disciplines. We then examine how its use has been adopted in software engineering; identify the forms of knowledge systematic reviews can provide; describe the design and conduct of our own study; and report our findings. We also examine the ways that other disciplines have used such knowledge to inform their teaching, and what we might learn from their experiences.

The evidence-based paradigm and software engineering
Use of the evidence-based paradigm originated in what has become known as Evidence-Based Medicine (EBM), by which a medical practitioner can draw upon the findings and recommendations from systematic reviews to aid them in making decisions about how to treat individual patients. Some of its success, in terms of its widespread adoption, derives from the nature of the clinical studies used in the reviews. While these are human-centric, the participants are usually recipients of the treatment being studied, and so any variation in the outcomes is likely to occur mainly because of physiological differences among the participants. This, together with the extensive use of Randomised Controlled Trials (RCTs)-which are field experiments with rigorous controls, allows the findings from a set of primary studies to be synthesised using statistical meta-analysis [35]. Use of such forms of analysis makes it possible to assign a high level of confidence to the outcomes.
The evidence-based paradigm has also been successfully adapted to the needs of other 'social' disciplines (in which humans interact with each other), including management, education and psychology as well as to more general social and health-related fields [10,73,13]. For these, other forms of synthesis that may be more appropriate to particular forms and mixes of primary studies have been developed. An overview of the forms that are potentially useful for software engineering is provided in [23], and in addition, a form of synthesis that can aggregate qualitative and quantitative evidence has been proposed for use in software engineering research [61].
While the term evidence-based software engineering (EBSE) is often used in analogy with evidence-based medicine (EBM), this can lead to inflated expectations. Rather than RCTs, empirical research in software engineering employs a mix of primary study forms that is actually more typical of the social sciences. In addition, the 'treatment' used in software engineering studies usually involves participants in actively performing creative tasks related to software development, rather than being passive recipients. Since these tasks are likely to differ in detail between studies, this makes it more difficult to synthesise the data using forms such as statistical meta-analysis. (Comparison with a number of other disciplines using systematic reviews suggests that the discipline most similar to software engineering is that of Nursing and Midwifery [17], which helps to highlight the 'social' nature of software engineering, where humans both interact with each other, and also with (or via) technology.) For many disciplines, systematic reviews, are apt to be commissioned by policy-makers and research agencies, and hence the topics studied are likely to be ones considered to be of strategic importance to that discipline and its practitioners. In addition, the task of searching for primary studies will often be performed by trained librarians [13]. In contrast, for systematic reviews on software engineering topics: -coverage of key topics is uneven (see Appendix B) and the choice of topics appears to be almost entirely researcher-driven, with little to indicate that professional bodies, research agencies or industry have so far taken much interest in identifying suitable topics; -the quality of reviews is apt to be uneven, particularly with regard to the rigour with which the primary studies are selected, categorised and synthesised [78]. -reporting is apt to be poorly structured and findings are not presented clearly [18]; -many studies use unnecessarily weak forms of synthesis [23]; Together, these influence the form and quality of the available knowledge.

The nature of Software Engineering Knowledge
In this section we consider what forms of knowledge useful for teaching about software engineering practice can be provided from systematic reviews.
3.1. The nature of the knowledge provided from systematic reviews The knowledge provided from any systematic review can be expected to be organised around the research question that the review is seeking to answer, as well as whether this question is concerned with issues related to research or to practice. Three important aspects of this knowledge are: the form in which the findings are presented; the strength of evidence supporting these findings; and how useful they are.
In terms of their usefulness for teaching, in examining the reviews we selected, we have observed that systematic reviews commonly provide knowledge about practice in three different forms (and obviously, the findings of any review may consist of a mix of these).
The first way in which the presentation of the findings is structured is concerned with knowledge that has been derived from the experiences of others, in the form of lessons that have been derived about particular software engineering activities. The investigation of the effects of user participation in software development reported in [1] offers a good example of a topic where presenting qualitative knowledge about the experiences of others may well be the most useful form of knowledge to provide. Pedagogically, this can be viewed as providing broader knowledge about software engineering activities than can usually be provided in the classroom, or through practical exercises.
A rather different way of presenting knowledge that has been derived from experience is to provide a list of factors that should be considered when undertaking some task or adopting a technique. A good example is provided by [84], where the authors identify the factors that can make for the effective adoption of global software development practices. This type of knowledge can provide more directly useable guidance, possibly in the form of checklists, and hence can usefully be used to supplement classroom teaching about a given topic.
The third way to present knowledge is largely concerned with providing guidance about choices between different techniques. Such knowledge is more quantitative in its nature, and may well be involve ranking the different options in some way. A good example of using such a form is provided in the review by Dieste and Juristo [27] that assesses the effectiveness of different requirements elicitation techniques. From a pedagogical perspective, where the findings are organised in this form, they can be used to provide an authoritative basis for choosing to use particular practices.
The usefulness of any systematic review is also dependent upon the provenance for its findings-that is, how far we can be confident that the original primary studies are reliable and relevant. One reason for systematic reviewers to perform a quality check on the primary studies when performing a systematic review is to help make some assessment of their reliability, in order to inform the process of synthesis. If they conclude that a primary study was conducted well, this provides some reassurance that including its findings in a synthesis procedure will help with producing sound findings from that process. This in turn provides scope for assessing the strength of evidence supporting the findings from a review [28].
The issue of the relevance of the findings from the primary studies used in a review is more challenging. In [42] this is defined as the "potential impact the research has on both academia and industry", and the authors observe that the long maturation period for technology makes it "infeasible to use the actual uptake of research results by industry" as an evaluation tool. They propose an evaluation model for relevance that is based upon "potential for impact" and that uses four aspects: subjects; context; scale; and research method. Unfortunately, the reports of systematic reviews rarely provide much detail about these characteristics of the primary studies, and particularly about their context and scale, so we were not able to apply this model retrospectively in our analysis.
From the perspective of the teacher wishing to use the findings as supplemental material, the first aspect should require little more than explaining to students about the nature of a systematic review, in order for them to understand the nature of the evidence. An appreciation of the second (and to some extent the third) aspect may require a rather fuller explanation of the forms and limitations of empirical software engineering studies. However, since few software engineering systematic reviews provide any information about the strength of evidence, our own experiences suggest that this explanation need not be particularly detailed or extensive.

Categorising software engineering knowledge
Categorising and organising software engineering knowledge has been the goal of a number of initiatives. ACM and IEEE have jointly sponsored two that are relevant to this paper. -The Software Engineering Body of Knowledge (SWEBoK) [14].
-The software engineering undergraduate curriculum guidelines (SE2014), and within this, the Software Engineering Education Knowledge (SEEK) categorisation of relevant knowledge [5]. The first of these is largely concerned with identifying the topics that collectively comprise the activities that make up software engineering practices, and where possible, identifying good sources of material related to these. So it can be considered to provide an expert interpretation of the nature of software engineering itself.
The second is concerned with identifying what an undergraduate studying software engineering should know, and hence the SEEK has been used in this paper to categorise the systematic reviews identified within the tertiary study. Even where a student is not studying software engineering as the major element of a degree, these are still topics that they need to be aware of, although perhaps in less detail than would be appropriate for a specialist course. SEC As a framework, while the SEEK can appear to be organised around technology issues rather than 'social' issues, this impression is misleading. Table 1 lists the major Knowledge Areas (KAs) used to structure the 2014 version of the SEEK. Each Knowledge Area is organised as a set of Knowledge Units (KUs) and both in these and in the guidelines there is quite extensive emphasis upon the importance of more 'social' aspects of software engineering such as the human interactions that occur in agile development and groupwork. Also, as emphasised in the Curriculum Guidelines, the Knowledge Areas are not meant to be templates for modules.

Research Method
In order to answer the question posed in the title of this paper, we divided this into two separate, but linked, research questions, as follows. RQ1: Which systematic reviews published up to the end of 2015 produced findings that were relevant to teaching about practice in software engineering? RQ2: What guidance did each systematic review provide that could help a student (or practitioner) to understand how to make an effective choice or use of a technology or practice? To answer RQ1, we conducted a systematic mapping study of published systematic reviews (a tertiary study). We then used the Knowledge Areas from the SEEK to categorise those that were selected as being relevant. To answer RQ2, we analysed the outcomes from each of the systematic reviews that we included, in order to identify relevant findings and explicit recommendations. As a point of clarification regarding RQ2, we did expect that for students, the process of understanding this guidance was something that would usually be mediated by a teacher. Indeed, for both students and practitioners, we expected that the findings of a review would mainly provide 'help' by identifying those circumstances where a technique or practice might be most effectively employed (or where it would be inappropriate to employ it).
In the rest of this section, we explain our choices for the procedures required to answer these two questions, and then the following section describes how these procedures were implemented.

Scope of the study
For a systematic review, the aim should be to find all of the primary studies that can provide findings relevant to the topic of the review, in order to avoid bias. Because a mapping study has the purpose of creating a 'map' of the knowledge available about a topic, rather than synthesising its inputs, it does not usually need to be quite as comprehensive. Our aim, as posed in the title and RQ1, is concerned with establishing whether teaching about practice could be supported by the findings of systematic reviews. We therefore considered that our question could be answered from a suitably large sample of systematic reviews.
We also restricted the scope of our study to those systematic reviews for which the findings were published in journals. The page constraints of conference proceedings often means that reports of systematic reviews have to omit important details. Additionally, while many systematic reviews are first reported in conference proceedings, it is quite common for a later and fuller version to also be published as a journal paper. Since we were concerned with finding those systematic reviews that were reported in sufficient detail to be of use in making decisions and choices, we felt that it was appropriate to constrain our study to reviews published in journals. It was also considered that this would make our final selection more readily accessible for teachers, students and practitioners.
For the period to the end of 2009, we selected the journal papers from three existing 'broad' tertiary studies to form our set of candidate systematic reviews [48,52,24]. These studies used a mix of manual and electronic searching to achieve a comprehensive degree of coverage for that period. As no equivalent sources were available for the period January 2010 to end 2015 and the number of published systematic reviews was rapidly increasing, we searched five major software engineering journals for those systematic reviews published in this later period. These were IEEE Transactions on Software Engineering, Empirical Software Engineering, Information & Software Technology, Journal of Systems & Software, and Software Practice & Experience.
Our choice of these journals was made on the basis that these were major publishers of systematic reviews addressing software engineering practices. One of the journals (Information & Software Technology) also had a special section for systematic reviews.

The inclusion/exclusion criteria
We required that any reviews included in our study should address a topic relevant to practice (rather than research) and that these topics should also be relevant to 'introductory'-as opposed to 'advanced postgraduate'-teaching. (Since our model for this was based on the SEEK, it could be considered to cover anything that would be material for an undergraduate degree programme.) In addition the review needed to provide some knowledge about practice that was explicitly supported by a synthesis of the findings of the primary studies. The resulting inclusion/exclusion criteria for the study are described in Table 2. Table 2. The Inclusion and exclusion criteria adopted for this study Criteria Inc-1 The paper is published in a journal and either included in the three broad tertiary studies or in one of the five journals (depending on publication date). Inc-2 The topic of the paper is related to practice and is considered appropriate for use with introductory teaching of SE, as defined by the SEEK. Inc-3 The paper contains findings and/or recommendations that are explicitly supported by the reviewers' analysis. Excl-1 Systematic reviews addressing research trends. Excl-2 Systematic reviews addressing issues related to research methodology. Excl-3 Mapping studies with no synthesis of data. Excl-4 Systematic reviews that address topics not considered to be relevant for introductory teaching of SE.
To be included, a systematic review needed to meet all of the inclusion criteria, while it could be excluded if it met any of the exclusion criteria. Using the SEEK gave us a reasonably clear measure of the set of topics that we considered appropriate for answering our research question. In particular, even where a review might meet all of the inclusion criteria, if we considered its topic as inappropriate for an introductory course, it could still be excluded (Excl-4). Typically such reviews were on relatively advanced topics that combined different aspects of software engineering, such as "security in process-aware information systems" [56].

Searching for systematic reviews
Our decisions about scope, as described above, meant that the searching process was relatively straightforward. The set of 120 papers from the three broad tertiary studies were listed in the reports, and for the journals we employed a manual search of index sections. We complemented the manual search by using an electronic search to check for any systematic reviews that might have been missed (not all systematic reviews have titles that explicitly identify them as being reviews).

Quality Assessment
Quality assessment of systematic reviews is commonly performed by using the DARE criteria (Database of Attributes of Reviews) 1 that were originally devised for use in clinical medicine. In its current form, the DARE assessment is based upon the following five questions. 1. Are the review's inclusion and exclusion criteria described and appropriate? 2. Is the literature search likely to have covered all relevant studies? 3. Did the reviewers assess the quality/validity of the included studies? 4. Were basic data/studies adequately described? 5. Were the included studies synthesised?
For this study we adopted the use of DARE as a means of providing an assessment of how thoroughly each systematic review had been performed, and hence some indication of how reliable the findings from it might be. (This was also employed in the three broad tertiary studies.) In doing so we also adopted the widely-used convention of scoring each question using a three-point scale: yes (1); partly (0.5); no (0), with the maximum score then being 5.0. Table 3 explains how the scoring was interpreted for the DARE criteria in the case of this study.
For each of the DARE questions, a score of 'no' was awarded where there was an absence of information (apart from 'search coverage' where we had defined a lower bound). Likewise, a 'yes' indicated that the description or related operations for that criterion exceeded some threshold. A score of 'partly' indicated that, while something was provided, it might only be for some of the primary studies (say), or that it was provided in some aggregate form. Hence a rating of 'partly' could be interpreted as 'present but incomplete'.
We should also note that DARE is only concerned with the systematic review process and whether these activities have been performed, rather than how well they have been done. However, until we have better reporting of systematic reviews performed in software engineering, it does not seem practical to employ some of the other forms of assessment discussed in [28] and [53].

Data extraction
Our inclusion criteria, as summarised in Table 2, required that we should be able to identify findings and recommendations for any systematic review that was to be included. This stemmed The authors have performed a meta-analysis or used another form of synthesis for all the data of the study. partly Synthesis has been performed for some of the data from some of the primary studies. no No explicit synthesis has been performed (as in a mapping study). from a concern that, to be of use, a study had to present results that end-users could readily employ. For the purpose of data extraction, we used the following descriptions.
-A finding provides knowledge about the topic that an end-user might find useful in order to gain knowledge about the topic 2 . However it is not of such a nature, or accompanied by such a degree of confidence, as to be able to act as the source of explicit advice about good or undesirable practice related to that topic. -A recommendation provides an operationalisation of a finding that provides deeper understanding and that can be taken into consideration when making decisions about practice. So if possible, a recommendation should be accompanied by some measure of its strength, derived from the evidence available from the systematic review.
Because the presence of these could only be determined with certainty at the stage of data extraction, we accepted that some decisions about exclusion would occur during data extraction. The data extracted from each study is itemised in Table 4. Table 4. Core data extraction from systematic reviews Item Description 1. Bibliographic information (title, authors, publication details). 2. Our scores for the DARE criteria. (As interpreted in Table 3). 3. Data about any quality assessment performed in the systematic review for the primary studies, including details about any checklist used for this. 4. Details of how the quality scores from Item 3 were actually used in the systematic review. 5. The size and nature of the body of evidence used in the review (numbers and types of primary study). 6. The context relating to the body of evidence: details of participant types, period covered by the searching, search engines used, details of any manual searches, use of snowballing, number of studies retained at each stage of inclusion/exclusion. 7. Any findings that are reported, or that could be derived from the later sections of the paper. 8. Any recommendations reported or that could be derived.

Conduct of the Study
The study was conducted according to the plan, and this section provides some details about the procedures followed as well as the outcomes.
Because there was some overlap between the set of systematic reviews selected for our earlier study [20], for brevity when comparing the two studies, we refer to that study as EPTS1 (Education and Practice Tertiary Study 1), and refer to this study as EPTS2.

Study Identification
As the set of papers found by the three broad tertiary studies was already determined, searching was only necessary for the papers from the five journals published in the period 2010-2015. The manual search process was conducted by one of the authors (DB) and involved reading through the contents pages of the five journals examining titles of papers, and where necessary, also inspecting the abstracts.
To complement the manual search, an electronic search was also performed by an independent researcher. This was undertaken in two stages. In the first of these, covering the period 2010-2014, the Scopus digital library was used to perform a forward citation analysis of six papers that discussed the principles of EBSE and systematic reviews (listed in Table 5). This was performed in April 2016. The papers identified as being systematic reviews or mapping studies and published in the five journals were compared with the papers that had been found by manual search. However, this identified a large number of false positives, and for papers published in 2015, this problem became much greater. So for the second stage (period) for 2015, Scopus was searched using the terms: TITLE-ABS-KEY ("systematic literature review" OR "systematic review" OR "systematic mapping study" OR "mapping study") AND DOCTYPE (ar OR re) AND PUBYEAR = 2015 AND (LIMIT TO (SUBJAREA, "COMP")). The results from this were sub-setted to select studies from each of the five journals and the papers that were identified as being mapping studies and systematic reviews were compared with the papers found by the manual search. This second search took place in May 2016.
The manual search identified 140 papers and the electronic search added a further 16, giving a total of 156 systematic reviews from searching the journals. All studies were allocated an index  [29] Procedures for Undertaking Systematic Reviews [47] Guidelines for Performing Systematic Literature Reviews in Software Engineering [50] Lessons from Applying the Systematic Literature Review Process within the Software Engineering Domain [15] Systematic Review in Software Engineering [12] number, those from the broad tertiary studies being numbered #1-120, and those from the journals #121-276.
Our sources are described in Table 6. For ease of reference, we have labelled these as Source-set1 and Source-set2. We have also indicated the number of papers obtained from each of these sources.

The Inclusion-Exclusion process
The process of inclusion/exclusion was performed in two stages. This was because the relevance of the topic could be fairly easily determined from the title and abstract, whereas determining the availability of appropriate findings and recommendations (Inc-3) did require that the complete paper had to be read.
In the first stage the two criteria used were whether or not a study was a systematic review, published in a journal, that addressed a potentially relevant topic (Inc-1 & Inc-2). The studies that had earlier been included in EPTS1, published in the period up to mid-2011 and described in [20], had already been identified as meeting the second criterion, and so the only action required was to remove those published in conferences. Hence a full selection process was only performed for the studies with index values #146-276, which were those published from mid-2011 onwards and hence had not been used in EPTS1. This was performed by all four authors, working in different pairings that were allocated on a random basis. The only exceptions were the papers for which two of us (DB and PB) were authors, which were assessed by the other two reviewers. If the reviewers were unable to agree on exclusion of a paper, it was retained for the second stage. Using the Fleiss' Kappa [31] to assess the level of rater agreement for this first stage, as we were using multiple raters, produced a score of 0.490, which indicates moderate agreement, falling into the band of values usually considered as being acceptable ("fair to good") [8].
The second stage was combined with the process of data extraction, which was based upon the data extraction model described in Table 4. This was applied to all of the reviews identified from the first stage, and all data extractions were performed by two members of the team, working independently, who then resolved any differences to produce an agreed dataset for a review.
Studies were only retained at this stage if we could identify clear findings and/or recommendations that could be linked to the data extracted as part of the systematic review (criterion Inc-3). Some papers originally included in EPTS1 were excluded in this stage, on the basis that they had inadequate findings or recommendations. Figure 1 provides an overview of the overall process and resulting numbers. We have referred to the reviews selected from Source-set1 as Dataset1, and those from Source-set2 as Dataset2.

Source-set1
Source-set2  43 18 Selection based on potential relevance Because identification of the findings and recommendations from the systematic reviews was often a complex process (elements of these were apt to be spread around the final sections of a paper), we performed a further check upon the reliability of our interpretation. For each systematic review we tried to contact the designated corresponding author by e-mail and asked them to comment on our interpretation of the outcomes.. Where this was no longer a valid address, we then tried contacting any of the authors for whom we could find a suitable e-mail address. In only two cases were we unable to trace any of the authors. We received 27 responses, all of which were generally in agreement with our interpretation, with 16 of them suggesting changes of wording, with all of these being minor.
The final set used in this study (EPTS2) contained 49 reports of systematic reviews, listed in Appendix A. Because the data for one systematic review was used for two analyses (#54 and #118), both of which met the inclusion criteria, there were actually 48 sets of primary studies.

Quality Assessment
An assessment against the DARE criteria, using the interpretations provided in Table 3 was performed as part of the process of data extraction, and using the same randomly-allocated pairings of reviewers.

Categorisation against the SEEK
To categorise the systematic reviews against the SEEK two of the reviewers (DB and PB) performed another analysis of the reviews after all of the data extractions had been completed. Again, our argument for doing this as a separate analysis, as against performing it as part of data extraction, was largely a matter of ensuring greater consistency of interpretation. It was considered that this would be more easily achieved if the whole set of studies was categorised in a single process.
For each study we determined both the most appropriate Knowledge Area (summarised in Table  1) and also what we considered to be a suitable assignment to the more detailed Knowledge Unit within this. While for some studies the most appropriate KA and KU values were relatively obvious, many did require quite extensive discussion to determine an appropriate allocation as inevitably, the topic of a systematic review and its findings may well span more than one KA or KU. Indeed, the nature of the findings may be more important than the topic of the review in terms of determining how it should most appropriately be categorised.

Further Data Extraction
As noted in Section 1, we have performed further analyses of the 49 systematic reviews. These are reported in [18] and [19] and involved some additional data extractions. These were performed by two of us (DB and PB) and are summarised in Table 7. Some of this supplemental information is included in Appendix B. We should note that for both of these, the process of study selection was as reported here. So, while they investigated further questions about reporting and provenance, their analyses were limited to providing answers related to studies about software engineering practice. Table 7. Additional data extraction from systematic reviews Item Description 9. Whether and how any quality scores derived for the primary studies were used (if at all). 10. The form(s) of synthesis used in the study, and whether these classifications were made by the authors of the systematic review or by us. Categories used were: meta-analysis, narrative synthesis, meta-ethnography, grounded theory, cross-case analysis, thematic analysis, vote counting, and 'other'. The definitions of these were taken from [23]. 11. The forms of primary studies used, where the primary studies were performed; who conducted these, and who formed the participants (students or industry practitioners) or what sources of data were used (industry or artificial).

The findings-What knowledge is available?
To present the outcomes from the process described in the previous sections, we begin by providing an overview of all of the studies. We also look at some of the supplementary information about these, with particularly regard to such aspects as provenance. We then look at the studies in more detail, and in particular, present the findings and recommendations that were extracted for each one. These are grouped under the different SEEK headings, enabling us to also comment on the extent of the available knowledge for each heading.

Summary of the Systematic Reviews
As the total number of studies is quite large, we have presented the summary of the findings for the two datasets separately in Tables 8 and 9. This is largely a convenience for presentation, although it also helps distinguish the reviews that were undertaken when the practices for systematic reviews in software engineering were less well established. Both are described using the same format. Each entry is described in terms of its index number (#1-#276) as used in this study, the period covered by the search in the systematic review, its topic, and reference. The tables also provide some of the key information about each review: the SEEK Knowledge Area (KA) it has been assigned to (the keys we use were provided in Table 1); the DARE score we derived for the study; the number of primary studies that we could identify as being either explicitly or implicitly conducted in an industry setting or making use of industry participants; and the total number of primary studies. The issue of the provenance of the primary studies is discussed in more detail in [19], where we have categorised the context for the primary studies used in each systematic review as far as we were able, based upon the available information. Two key points from this are worth repeating here. The first is that for those systematic reviews where we could not determine whether some of the primary studies were explicitly or implicitly conducted in an industrial setting, it is highly probable that many of these were actually industry-related, but the lack of detail meant that we simply could not tell. The second point is that what was considered to be an acceptable primary study in terms of the inclusion/exclusion criteria used in the review, and the way that these were interpreted, did vary quite considerably. Some reviews included a number of non-empirical reports among the primary studies, as well as papers that were classified as 'opinion', 'experience' and even 'theory'. So while other characteristics such as DARE scores might usefully be compared across a set of systematic reviews, it is definitely not appropriate to make comparisons between the numbers for each type of primary study as reported by different systematic reviews.
The answer to RQ1 (which systematic reviews produced findings relevant to teaching about practice?) is provided by the entries in Tables 8 and 9. Overall, as indicated, we were able to identify 49 systematic reviews (from 276) that contained findings considered to be of use in teaching about software engineering. In the tables, the systematic reviews have been grouped under the SEEK Knowledge Areas, which also highlights the uneven distribution of reviews across the KAs. Table  10 gives the counts of the reviews categorised under each KA. The large proportion categorised as PRO arises in part because much of what we do in software engineering involves processes. Many of the systematic reviews can be described as investigating 'best practice', where this may relate to testing, design etc., and these ended up being categorised as PRO wherever we concluded that the emphasis was more upon practice rather than the technology involved. The answer to RQ2 (what guidance did each systematic review provide?) is contained in the fuller descriptions of the findings and recommendations, together with their context, provided in Appendix B. As discussed earlier, we observed that systematic reviews provide guidance in a number of forms, largely depending upon the research question being addressed by the review. In Table 11 we identify those reviews providing each of the three types of guidance (experience, lists of factors, and comparisons). Inevitably, the findings of reviews do not always fall exactly into one of these categories, and so we have only included the 35 systematic reviews where we collectively felt that the findings mainly fitted one category. What Table 11 does show though is that few reviews provided comparative findings. VAV was the only KA for which there was more than one systematic review (3 from 5) providing comparative findings, largely because these were comparing testing practices  37 49 that produced deterministic outcomes concerned with whether or not a test was successful. (Many aspects of software engineering, such as analysis and design, address 'ill-structured' problems [83], and so rarely provide true-false results when comparisons are being made.) In this subsection we present the material that helps answer RQ2 (guidance "that could help a student (or practitioner) to understand how to make an effective choice or use of a technology or practice"). For each review, we excluded any findings that were related to research issues or future developments. (Almost every systematic review identifies a need for more and better primary studies.) Where possible, we have taken the wording for the findings and recommendations directly from the systematic reviews. One consequence of this is that the findings from different systematic reviews are apt to be formulated at different levels of granularity. However, given the heterogeneity of the reviews, we considered that it was impractical to present the findings in a uniform matter.
In reporting the findings, we have also provided information about their provenance, wherever this was available. This information is provided to aid the reader to make some assessment of the confidence that they might choose to place in the findings. However, as noted earlier, the variation in different reviews between the way that primary study types were interpreted, as well as in the inclusion/exclusion criteria used, means that this information should only be treated as indicative.
The details for each systematic review are provided in Appendix B. For each review we provide the following information (where available) 1. The main SEEK knowledge area and knowledge unit identified as appropriate.
2. The title of the systematic review. 3. Citation details. 4. The DARE score, reported on a scale of 0-5. 5. Any information available that might provide an assessment of the strength of evidence for the findings. Where possible, we report this for each finding. 6. The number of primary studies. Where possible, we included the following additional information: -The count of primary studies that we could explicitly identify as being conducted in an industry setting. -The count of primary studies that were implicitly conducted in an industry setting, based upon comments in the text. -The count of primary studies conducted in an 'academic' setting (such as experiments that used student participants). 7. The form(s) of synthesis used in the study, noting that some did use more than one form to answer different research questions. (We did not attempt to classify the forms of synthesis used in the earlier studies (Dataset1).) 8. The findings from the study. 9. The recommendations from the study. 10. Information about any response from the authors to our request for them to check the accuracy of our extraction of the findings and recommendations. We have grouped the reviews according to their assignment to SEEK Knowledge Areas. For each review, we suggest the most relevant Knowledge Area and Knowledge Unit, accepting that many reviews do not fit neatly into the SEEK model. We have also noted where there are Knowledge Units (other than those dealing with issues such as 'concepts') for which there are no systematic reviews, in order to help illustrate the overall degree of coverage.

Findings-Fundamentals (FND)
Perhaps not surprisingly, there is only one systematic review categorised under this heading. The key details for this are provided in Table 13. The reason for including this review under FND was that we felt it best fitted the Knowledge Unit engineering economics for software. (This was the only heading for this KA that did not address 'foundations'.) In this review, the conclusions about the reasons for adopting SPI (Software Process Improvement) largely reinforce the claims made in the literature.

Findings-Professional Practice (PRF)
We classified four systematic reviews under this heading, described in Tables 14, 15, 16 and 17. Two of these (#54 and #118) used the same dataset, but performed quite different analyses of the material. We were also unable to determine a specific Knowledge Unit for those two analyses, due to the wide span of issues that they address. There were no systematic reviews directly addressing the KUs communications skills or professionalism, although some other systematic reviews did indirectly address issues related to team and group communication.
The first two reviews (which share a dataset) address issues around what motivates software engineers and provide details of factors considered relevant. Study #135 is also related to staff (de)motivation, providing a set of related recommendations. The remaining study addresses the role that group dynamics plays when participating in open source development.

Findings-Software Verification and Validation (VAV)
There were seven reviews included under this heading. These are summarised in Tables 18-24. These reviews provide a set of findings that span three of the four Knowledge Units making up the VAV Knowledge Area. We have no reviews for one KU, problem analysis and reporting.
These reviews span a range of issues. Most are concerned with techniques for selecting or evaluating tests (such as those used for regression testing) and provide rankings of different approaches that are likely to be directly applicable to practice.

Findings-Software Design (DES)
Software engineering can be considered as very much a 'design' discipline, with 'design thinking' permeating many activities, including of course, software design. However, the creative element involved in designing also means that this Knowledge Area forms a significant challenge for empirical studies. There are only three systematic reviews in this group, summarised in Tables 25-27, although they do address three separate Knowledge Units. KUs with no contributions are design concepts, human-computer interaction design and design evaluation.
None of the reviews offer very strong conclusions, and in the only one that offered more specific guidance about design choices (#130), it was noted that these were based upon a low strength of evidence.

Findings-Modelling & Analysis (MAA)
The four systematic reviews under this heading are all classified as belonging to the same Knowledge Unit (types of models). Given that the other two KUs address foundations and fundamentals, this is perhaps not surprising. Tables 28-31 provide a summary of these reviews.
The reviews span a range of issues including model reliability (#126) and observations about fault prediction (#155). Collectively they do provide helpful guidance about some specific models that are used by software engineers.

Findings-Requirements Analysis & Specification (REQ)
The three systematic reviews addressing requirements, described in Tables 32-34, cover two of the four Knowledge Units making up the REQ Knowledge Area. In particular, we have no reviews that address the KU requirements validation.
All the reviews provide useful insight into the approaches used in requirements engineering. Review #134 particularly provides a useful rating of different elicitation techniques, and all three offer useful insight.

Findings-Software Quality (QUA)
There is only one systematic review categorised under this heading. The key details for this are provided in Table 35. This was categorised against the KU product assurance, and there were no reviews covering the KU process assurance. The review does offer useful insight into the relative merits of a range of object-oriented measures.

Findings-Software Process (PRO)
By far the largest set of reviews fall into this Knowledge Area (which does form something of a 'catch-all'). We have grouped these by Knowledge Unit (KU), although we should note that only three of the five knowledge units were covered. There were no reviews classified as configuration management (PRO.cm) or evolution processes and activities (PRO.evo). Tables 36 onward provide the details for this set of reviews.
With so many reviews being classified as belonging to this KA, it is difficult to provide a concise and general summary of what is useful for practice and teaching. Many of the reviews classified against project planning and tracking offer quite specific and detailed advice that is highly relevant to both teaching and practice. In contrast, for process concepts many of the findings tend to be more in the nature of observations, with process implementation coming somewhere between these. Overall though, this set of reviews do provide a fertile source of experience for others to draw upon.

Discussion
We first consider what the outcomes from our study tell us about the knowledge available from this set of systematic reviews, and what the limitations on this knowledge are. We go on to consider how this knowledge might be used to inform teaching (and hence in the longer term, practice) by looking at how such knowledge is used in other disciplines, and hence what lessons might be learned. We then consider the threats to validity for this study, since we need to determine how trustworthy our findings are, both in terms of the selection of the systematic reviews, and also the findings and recommendations that we extracted from them. Finally, we consider how such knowledge can be gathered more effectively and completely in the future, and in particular, how it might be possible to avoid having to do this retrospectively (and laboriously), as in this tertiary study.

How good is the knowledge available from systematic reviews?
In answering RQ1, we can identify 49 (out of 276) systematic reviews that provide knowledge about software engineering practice and hence might be used to support teaching about software engineering. The systematic reviews that we identified also span a range of topics when matched against the SEEK, although they are not evenly distributed between the Knowledge Areas. The extent, quality, and form of the knowledge is also unevenly distributed, with some reviews providing findings that provide quite useful information about practice, while others are rather less specific.
In addition, few reviews provided any indication of the strength of evidence available to support their findings from the primary studies. Examination of the 49 reviews shows that only two of them (#130 and #239) made use of the GRADE approach to assess the strength of evidence for their findings [36], as recommended in [28]. A few of the others (#008, #022, #039, #197, #215) did also make assessments through unspecified means. Where provided, such assessments tend to indicate a strength of evidence for recommendations as being 'low' (#130 and #239) or 'modest' (#008). However, as noted in the revised guidelines on conducting systematic reviews in software engineering [53], empirical software engineers "must often make do with much weaker forms of study" (than those working in other disciplines).
It is also worth noting that some of the more qualitative reviews, such as those identifying "factors relevant to the adoption of X", are unsuited to the use of an approach such as GRADE. A number of these did provide tables that listed and enumerated the primary studies that identified a particular factor as being significant, with examples of this occurring in #54, #161 and #205.
To address this question, we have identified the set of systematic reviews which we consider offer both useful and usable guidance about practice. To select these, each of the authors was asked to rate each review, using the information presented in Appendix B, and assigning one of the following values to it.
'y' if the review was one that could be readily used as an example when teaching; 'p' for reviews that might be used; 'x' if the review should not be used as an example.
In performing the rating, each author was asked to consider the following three factors. 1. The usefulness of the review: such that its outcomes relate to a reasonably 'mainstream' topic that might be included in an introductory course on software engineering. 2. The usability of the review's findings, whereby these can provide some element of guidance about what a software engineer might be advised to do in practice. 3. The quality of the review, largely based upon the DARE score. It was suggested that a score of ≥ 3.5 would be acceptable, while also bearing in mind that earlier reviews often had less conventional reporting structures. Each review was considered on its own merit, and there was no constraint upon how many reviews could be given a particular rating. (The number of 'y' ratings employed ranged between 10 and 17.) Since our teaching experiences stemmed from teaching different courses and we had different interests within software engineering, we did not expect to obtain close agreement from this process. So a 'score' for each review was computed by assigning each 'y' to a value of 1.0, a 'p' as 0.5, and an 'x' as 0.0, and then summing the four values. Table 12 shows the index values for the top-scoring reviews that emerged from this process. We have also indicated the type of knowledge provided by these reviews, where a single value was available, and provided a summary of their findings together with a reference to the Table in the appendix where further details can be found. While not too much weight should be placed upon this relatively informal exercise, it is interesting to note the predominance of reviews categorised as VAV and PRO, as well as of reviews with more 'structured' findings in the form of lists or comparisons. It does also indicate that, while all 49 reviews were considered relevant enough to be included in the tertiary study, few of them achieved this quite basic quality threshold for the three criteria, with 'good' studies available for only a few Knowledge Areas.
So, to answer the question posed in the heading for the sub-section (and RQ2), we can conclude that while suitable evidence-based material is becoming available for use by teachers, only a rather disappointingly small proportion of systematic reviews appear to have findings that can readily be used.
However, there is one quite important caveat that should be mentioned here. In the above exercise we only considered direct use of this material in teaching on introductory courses, enhancing what is covered in the textbooks. There are however other ways of using this knowledge, such as in course design (for example, using the findings on the unsuitability of design patterns for use by novices to determine how this topic would be covered in a course). There is also scope to use the findings differently on more advanced courses, including postgraduate ones, or with individual student projects. All of the 49 reviews are viewed as having findings that are potentially useful, but these may need to be used in different ways. We address this issue further in the next sub-section.

Using the findings to support teaching and practice
Having identified a set of systematic reviews that contain knowledge that is useful for teaching and practice, this raises the question of how to use this material? To help answer this we looked at how other disciplines make use of such material. Studies in education and healthcare have investigated how students and practitioners understand and engage with the findings from empirical research. This is relevant for software engineering, since using the material from this study would require familiarity with evaluation practices and empirical studies.
In education, a rapid evidence review investigated what is known about effective approaches to school and teacher engagement with evidence [65]. The report points out that knowledge mobilisation, the process of making research findings more accessible and usable requires a supportive infrastructure, including collaborations between researchers and teaching professionals, intermediaries to translate evidence into tools and professional bodies that provide leadership on the use of evidence in education. Also, the review suggests that evidence needs to be contextualised and presented in clear and structured summaries of effective approaches. This point is also made by Goldacre 3 , who emphasises the need for better support for the dissemination of research findings, as well as by others, in relation to evidence based healthcare, where structured abstracts and plain language summaries are advocated [40,80].
As well as learning the skills necessary to acquire, appraise and apply evidence, students and trainees can also benefit from acquiring an awareness of ways to use this knowledge to bring about change at the organisational level [60]. A discussion of this is beyond the scope of this paper, however, desirable skills might include being able to identify where changes to guidelines or to established practice are needed and where change would be worthwhile.
The importance of leadership in enhancing engagement with, and use of, research findings is also a key message from a recent study on evidence-informed teaching practice, published by the UK's Department of Education [22].
Viewed overall though, there seems to be little guidance available on how to provide advice for teachers about using empirical material such as the knowledge-set from these systematic reviews to support the way that software engineering is taught. Clearly, as such knowledge accumulates, this will present an increasingly important pedagogical research question to be pursued.

Limitations of this study
We can identify a number of limitations upon the outcomes from our tertiary study that stem from the way that we performed the various elements of the study. We discuss these here, together with any factors that may help to alleviate their effects. 1. One limitation is the way that we selected the secondary studies (Dataset1 + Dataset2). Since we were performing a mapping study, we did not attempt to find all of the systematic reviews that were published during the period covered by our tertiary study, and confined ourselves to those reviews identified in the three broad tertiary studies and then the five software engineering journals, while explicitly excluding any studies published as conference papers. We did however conduct a broad electronic search as a check that we were not missing any significant source of systematic reviews, and we should observe that eight of the 11 reviews included in Dataset1 were published in the five journals that we used in the later part of the search.
Since we were investigating the use of systematic reviews in teaching, there was the possibility that relevant reviews could be found in educational journals related to software topics. A check of the papers published in ACM Transactions on Computer Education (TOCE) and IEEE Transactions on Education (ToE) for the period 2004-2018 inclusive, identified only five systematic reviews. All of these were addressing pedagogical knowledge rather than 'topic' knowledge and we could not identify any papers related to the use of evidence-based material in teaching. 2. In our original research protocol we selected a cut-off date for inclusion as the end of 2015.
Because the processes of inclusion/exclusion and data extraction were complicated by the heterogenous nature of the selected set of systematic reviews, and as changes in circumstances also meant that two members of the team would not be available for this task, we felt that we could not ensure that any extension would be consistent with the original study, particularly regarding the interpretation for Inc-2 and Inc-3. As explained in §1 we also performed and published two other analyses on this dataset, further delaying the production of this paper. There is therefore the possibility that in the time following our cut-off date and submitting this paper, there may have been some changes in the way that systematic reviews have been reported, and obviously, new topics will have been covered. Informally, based upon our experiences over this period reviewing systematic reviews as well as performing some informal monitoring of journal contents, we have not observed any developments that would have significantly affected our findings. It is also possible that the balance of systematic reviews across the SEEK KAs might have changed. However, topics such as design and requirements elicitation still continue to present some real challenges to conducting rigorous primary studies [34], limiting the scope to perform systematic reviews for those KAs. 3. When calculating the DARE score for a review, our definition in Table 3 does not address the question of whether or not the search conducted by the reviewers was adequate for the purpose of the systematic review. While it would be desirable to make such an assessment, we did not feel our knowledge about the research areas related to the review topics would allow us to do this in a consistent manner. 4. The selection process that we used to identify relevant systematic reviews did require an element of human interpretation, including for inclusion criterion Inc-2 (relevance to introductory teaching). We drew upon our experience of teaching software engineering to determine whether a review addressed a suitable topic, as well as using randomly allocated pairs of team members for all aspects of this part of the process, and discussing our decisions. 5. Further interpretation was required for the purpose of identifying the findings and recommendations embodied in a review (criterion Inc-3). The quality of reporting did not always assist with this [18], so as a check, we did seek to consult the original authors wherever possible. We received responses from approximately half of these, with no-one suggesting other than minor rewording or clarification, which suggests that we managed to perform this task fairly well. 6. Our supplementary data extractions were performed by two of the authors, partly to ensure consistency of interpretation. For our interpretation of the synthesis methods adopted in the 49 systematic reviews, we were able to check a proportion of our decisions against a baseline study [23]. 7. For categorisation of the studies against the SEEK we again used two members of the team, to provide consistency in our allocations. Since many systematic reviews span different knowledge areas, this is very much an issue of interpretation, and we would certainly advise anyone seeking knowledge about a topic to check that whether it appears as an element in other studies (particularly those categorised as PRO). 8. Our informal assessment of how 'useful' reviews were (summarised in Table 12) used a simple ranking procedure as described in Section 7.1. However, our individual assessments, as indicated by the different numbers of 'y' rankings used by each assessor, were inevitably influenced both by our own teaching experience (as we note) and also possibly by our familiarity with the topics of specific reviews. 9. We were unable to obtain assessments of the strength of evidence for the findings from most of the reviews. Where an assessment was available, the findings were generally rated as being based on low or moderate strength of evidence. Hence there may be some variation in consistency between different elements of our overall dataset, particularly as regards the confidence that we can place in the findings from each systematic review.

The way ahead?
Conducting a tertiary study of this form requires quite extensive interpretation of the reported findings from a heterogeneous set of systematic reviews. So an obvious question is whether this knowledge, assuming it is considered to be useful to the community, can be extracted from the reports of systematic reviews by other (and better) means in the future. In particular it would be better if this avoided the need to perform studies such as this one that involve retrospective analysis, both because the distance from the original study means that much of the knowledge about how it was done may not be available, and also because the original systematic reviewers can be expected to possess greater expertise about the topic of a review, as well as being better able to assess the quality of the evidence [34].
A relatively simple and efficacious mechanism for enabling this does exist, and is already used in other disciplines [55]. In healthcare research where the needs of policy-making may go alongside those of practice, this consists of requiring that a systematic review and its findings are reported as a set of documents with different lengths and levels of abstraction, in order to meet different needs. The Canadian Health Services Research Foundation describe this as a 1-3-25 format, consisting of: a one-page summary of 'take-home' messages; a three-page executive summary; and a more detailed report. Like any such mechanism this is not infallible of course, and as Oliver and Dickson comment: "some teams were better than others at producing a policy-friendly report" [67].
Adapting this model to the needs of software engineering appears to be quite feasible. At its most simple, it would consist of requiring, as a condition of publication, that authors also provide a one-page summary of their findings, worded in a form that made them readily accessible to practitioners and students, and including an estimate of the strength of the supporting evidence. Appendix A provides two examples of a one-page summary to illustrate this concept. The first is a summary of this tertiary study, while the second is a summary of a systematic review from our set of 49, for which one of us was an author (#154). When used for healthcare reviews, the single page often consists of a brief summary of the purpose of a study followed by a set of bullets that summarise the key findings, and we have largely adopted this model. However, for teaching purposes this may need to be supplemented by a more effective visual structure such as the one proposed for evidence briefings [21], and our choice of layout has also been influenced by that model.
There are obviously a number of practical issues to address in creating such a mechanism (including obtaining the cooperation of journal editors). It would require reporting guidelines for authors (we already have a set of these in [18]); a means of checking that the summary was appropriate; and (preferably) some central means of indexing the summaries. But in exchange, adopting such a system has the potential to make it likely that future reviews had findings that were translated for practice by the people most familiar with the material. Prospective reviewers would also be able to check more easily if there was an existing systematic review addressing their planned topic.

Reflections and Conclusions
Our tertiary mapping study identified 49 systematic reviews, published in the period 2004-2015, that contained findings and recommendations considered to be useful for teaching about software engineering (RQ1). Within these, we were able to identify a smaller number that did provide guidance and information that could be used to help make "effective choices" (RQ2).
However, it is evident that useful findings are available from only a small proportion of the published systematic reviews that we surveyed. There may be many reasons why this is so: one of which may simply be that in software engineering the role of the systematic review has so far been mainly to be used as a tool to aid research and to provide a useful training exercise and preparation for PhD students.
This underlying emphasis upon research may also explain many of the quality issues that have been identified regarding the conduct and reporting of systematic reviews in software engineering. Some may well arise because there is therefore no requirement to report to an external sponsor, others because the reviews are sometimes conducted by relatively inexperienced researchers. In contrast, other disciplines tend to use information specialists to undertake much of the work involved in searching and selecting material [13].
Following on from these conclusions, empirical researchers and others might wish to consider how researchers can better provide information about the outcomes from systematic reviews, so that this is of greater use to others. From this study, and from the other analyses we have performed upon our data, we can suggest three mechanisms that could contribute towards achieving this aim. 1. Providing better reporting of the conduct of a systematic review. In our analysis of reporting quality [18] we identify 12 lessons about reporting, and suggest a checklist that should be used by reviewers (and authors). Better reporting can help to establish the provenance for the findings from a review, and so help justify its publication. 2. Facilitating better reporting of the findings from a study. In part this overlaps with item 1 above, in that the reporting of a review should make its findings clear. This was only the case for fewer than one in five of the 276 systematic reviews that we examined, and even for the 49 included in the final set, we often found it difficult to extract the findings and recommendations, as these were sometimes spread over different sections of a paper. In addition, as discussed in the previous section, making the provision of a summary of findings a pre-requisite for publication will also help to make the findings more widely and readily available to others. This is clearly a concern that is also shared by other disciplines, hence the emphasis upon such mechanisms as the 1-3-25 model. 3. Creating the means to provide effective curacy of the knowledge about and from systematic reviews, particularly as the number of these increases. As a newcomer to the use of systematic reviews, software engineering has so far not embraced the idea of creating anything equivalent to the Cochrane and Campbell collaborations that oversee and facilitate the conduct of systematic reviews in clinical medicine and social science respectively. These bodies play a number of roles, including providing public information about relevant findings from systematic reviews. We see the first two mechanisms as needing to be adopted in collaboration with those journals that publish systematic reviews. And all three may need the involvement of the professional bodies. What is clear from our findings though, is that without such interventions, systematic reviews in software engineering will very likely remain a tool used mainly for academic research rather than, as in other disciplines, forming a valuable (and valued) source of knowledge for software developers, teachers, and researchers.

A. Examples of a one-page summary
What support do systematic reviews provide for evidence-informed teaching about software engineering practice?--Implications & Messages Implications Systematic reviews provide a rigorous way of gathering together evidence obtained from empirical studies. Since 2004 systematic reviews have been used quite extensively by software engineering researchers to examine a range of software engineering practices and the use of different technologies.
The findings from a systematic review provide objective and unbiased knowledge about using a practice, that can underpin advice to practitioners, teachers and students, and which can help them assess the likely benefits of adopting it in a particular context.

Key Messages
• Systematic reviews can provide useful guidance for practice and for teaching about practice that can take a range of forms, including:  a digest of the experiences of others (for example, related to adopting a new practice such as agile development);  a checklist of the factors that should be considered when thinking of adopting a new practice or technique;  comparisons between different options, such as occur when identifying the most dependably effective practice to use for requirements elicitation.
• Much of the guidance and knowledge provided by the systematic reviews was derived from primary studies that involved observing how practising software engineers performed tasks 'in the field'.
• Researchers need to provide their findings in a more 'end-user-friendly' form (such as by using a one-page summary like this one) that also explains what the implications of the findings are. This will help teachers, students and practitioners to identify those messages that are useful to them.
• A characteristic of software engineering is that, unlike other disciplines, topics for study using a systematic review are chosen by researchers themselves, rather than being selected to meet the needs of practitioners, policy-makers or funding agencies.

•
There is a need to provide readily-available indexing of the findings from systematic reviews to assist end-users with finding material that they need. This would also help researchers to identify where new systematic reviews, or updating of existing ones, would be useful. We suggest that this is a role that the professional bodies such as ACM could assist with, working in collaboration with journal editors.

Characteristics of our systematic review
From 276 candidate systematic reviews published up to the end of 2015 we selected 49 that provide knowledge that we considered useful for teaching and practice. For each of these we describe: • the topic;

•
The number of primary studies used (and the types of these, when known); • how the outcomes from the primary studies were synthesised; • key findings relevant to teaching and practice.

Implications
Object-oriented design patterns offer a mechanism for transferring experience about useful design structures (knowledge schemas). Our study sought to determine how extensively the popular GoF (Gang of Four) patterns have been studied empirically, and what might be learned from these studies. It also looked at the consequences that might arise from using patterns when designing software applications.
Activities such as software design pose a challenge for empirical studies because of their creative nature. Partly because of this, only a small number of studies involving design patterns were available. In turn, these could only provide limited guidance about the usefulness of the relatively few patterns that have been the subject of multiple studies, and were unable to provide clear guidance about when it is appropriate to make use of specific design patterns.

Key Messages
With regard to the effective use of OO design patterns: • There is reasonably good support for the claim that using patterns can provide a vocabulary that improves communication between developers and maintainers, at least, when the way that the patterns have been used in the design is well-documented.

•
There is no support for any claims that using patterns help novices learn about how to design applications.
• It appears likely that the successful use of patterns is highly dependent upon both the nature of individual patterns and the experience of the developers concerned. Simply using patterns does not ensure good design, they have to be used appropriately.
And for the studies themselves: • The primary studies that were available mainly focused upon studying the ease with which applications created using patterns could be understood and modified, and only a few examined issues related to the use of patterns to create new software.

•
Many of the experimental studies used students as participants, which may well be inappropriate, and overall the variations observed in the findings may arise because of the complications of having a large number of confounding factors.
• We recommend that future empirical studies focus upon studying the use of specific patterns, and avoid making use of student participants or asking participants to perform small-scale tasks. We also suggest that case studies may be more suitable vehicles for exploring the complex cognitive issues involved in using patterns.

Characteristics of our systematic review
Our study identified 10 papers (from 611 candidates) that described 11 experimental studies about the use of OO design patterns described by the GoF. A further seven informal observational studies were used to help interpret their findings. We noted that: • Only Composite, Observer and Visitor had been studied fairly extensively.
• Few other patterns had been studied in more than two primary studies.   1. Organisations adopted CMM based SPI mainly to improve product quality and project performance but also to improve process management. 2. Satisfying customers was not a common reason for adopting CMM-based SPI.
3. The two most common process related reasons for adopting SPI were to make processes more visible and measurable. 9. Recommendations None.

Author Response
The authors observed that meeting 'customer demands' in the form of contractual requirement was a fairly major reason for adoption, rather than 'customer satisfaction'. They also observed that providing assurance for customers through high ratings was a legitimate reason for recommending the adoption of CMM(I).
B. The findings and recommendations from the reviews   "Publications reviewed suggest that male IT workers are more likely to leave an organisation than their female counterparts. Younger employees also appear more inclined to leave (mainly due to lower job satisfaction) compared to their older counterparts). Importantly, higher educated IT professionals are more likely to leave a company because of low job satisfaction. Additionally, married IT practitioners as well as those with a lower organisational tenure have a lower tendency to leave an organisation. IT managers can use these insights to assist with their recruitment decisions and employee retention initiatives." 9. Recommendations 1. To overcome role ambiguity and role conflict, managers should: a. communicate clearly and provide clear and precise information about what they expect from their IT professionals. b. make sure that their personnel have the required training and knowledge to carry out their jobs well. c. allow their IT professionals to know the intent of and reasons for doing a specific task. d. better design and define tasks so that the start and end of each task is clear. e. clearly define the sequence in which sub-tasks are carried out. 6. determine task priorities associated with the job. 2. To overcome perceived workload demands managers should maintain an awareness of the workloads of their high valued IT professionals. Direct face-to-face communications has been reported as the most effective means of overcoming this problem. 3. IT managers should be conscious of the benefits of enhanced employee autonomy because lack of autonomy can lead to turnover decision through work exhaustion. Managers should provide IT professionals with enough autonomy and flexibility to reduce exhaustion they might feel because of the structure of their work and should design IT roles that offer enough freedom for IT professional to be innovative and pursue their own thoughts and ideas.
(plus five other recommendations, omitted for reasons of space) 10. Author Response None.        The Software Architecture Change Characterization Scheme (SACCS) was developed as a result of the review and can (could?) be used to assist developers and maintainers in assessing the potential impact of a proposed change and deciding whether it is feasible to implement the change. Where the change is crucial the scheme will (could?) help generate consensus on how to approach change implementation and provide an indication of the difficulty.

Recommendations
None.

Author Response
The authors reviewed our analysis and agreed with it. 3. In larger systems where concern scattering and tangling is expected to be widespread, introducing aspects is likely to significantly reduce the number of lines of code. 4. AOP has a positive effect on modularity (but context of use should be carefully assessed). 5. AOP has the potential to develop evolvable and maintainable software. 9. Recommendations None.

Author Response
None.  1. Source-code based solutions identify dependencies through code constructs such as function calls and shared variables. Approaches that use this concrete evidence have a high degree of accuracy when it comes to the dependencies they identify, which makes them very reliable and very attractive for practitioners as the resulting information is very tangible. However, they are less suited to analyzing runtime system behaviour.
2.Solutions using diagrammatic and semi-formal descriptions are more appealing for practitioners following architecture-driven approaches. Practitioners find these solutions useful to describe dependency information at an architecture level. However, for an efficient application of these solutions, it is necessary to keep up-to-date and synchronize the system requirements, design, and implementation. 3. Solutions using run-time and configuration information are applicable in practice due to two main characteristics. First, these solutions are non-intrusive with respect to the development activities. Often, in a research setting, the overhead and maintenance cost of an infrastructure to collect data for dependency analysis is overlooked, whereas practitioners are more concerned about the cost and overhead of maintaining a reliable and up-to-date instrumentation of their system. This is even more important, in heterogeneous situations where multi-vendor components are used and instrumentation cannot be inserted into the system because of security, licensing, lack of knowledge, or other technical constraints. Second, although these solutions are limited by their coverage and links to the system source code, practitioners consider these solutions valid approximations, especially for problem-driven approaches. 9. Recommendations None.

Author Response
None. 2. In terms of context, there is no evidence to suggest that the maturity of systems or language used is related to predictive performance 3. It may be more difficult to build reliable prediction models for some application domains (e.g. embedded systems). 4. The independent variables used by predictive models that work well seem to be sets of metrics. 5. Models that use KLOC perform no worse than where only single sets of other static code metrics are used. 6. The spam filtering technique based on source code performs relatively well. 7. Models that perform well tend to use simple, easy to use modeling techniques such as Nave Bayes or Logical Regression. More complex modeling techniques such as SVM tend to be used by models which perform relatively less well. 8. Successful models tend to be trained on large datasets which have a relatively high proportion of faulty units. 9. Successful models tend to use a large range of metrics on which feature selection was implied. 10. For successful models, default parameters for the modelling technique were adjusted to ensure the technique would perform effectively. 9. Recommendations None.

Author Response
Agreed with our extracted findings. 3. Time-series analysis/statistical process control also shows good results in identifying sharp shifts in process performance as well as shifts due to changes in the process. 4. To be able to give a recommendation on the predictive accuracy of regression for software productivity, the model should be built on a sub-set of data points and then used to predict the remaining data points. Thereafter the difference between prediction and actual values should be observed and measured. 9. Recommendations 1. When using univariate models it is important to be aware of high variances and difficulties when comparing productivities. Hence it is important to carefully document the context to be able to compare between products. Comparison should not be on productivity value alone and it is recommended that a scatter diagram be produced based on inputs and outputs to assure comparability of projects with respect to size. 2. When comparing projects it should be made clear what output and input consists of, for example, which lines are included in LOC measures. 3. When possible, use multivariate analysis when data is available, as throughout the software process many outputs are produced. Otherwise, productivity is biased towards one measure (eg LOC). 4. Managers need to be aware of validity threats present in the measures when conducting a comparison. Data should be interpreted with care and awareness of possible bias and noise in the data arising from measurement error. 5. No generic prediction model can be recommended as studies do not clearly agree on what are the predictors for software productivity. In fact, the predictors might differ between contexts. Hence companies need to identify and test predictors relevant to their context.

Author Response
The authors identify the following additional finding. 5. Data envelopment analysis is promising as it supports multivariate productivity measures, and allows identification of reference projects to which inefficient projects should be compared. This helps with identifying projects from which one can learn, and that are similar, so that evidence may be transferable.   No evidence of widespread adoption and impact of SPSM research on industry.

Recommendations
When using software process simulation models for scientific purposes, need to be sure that the appropriate steps with respect to model validity checking have been conducted, and do not rely upon a single simulation run.

Author Response
We have used a slight rewording of the recommendation suggested by the authors. A set of 38 best practices has been collected and classified into five main areas: method, supportive tool, procedure, documentation and user best practices.

Recommendations
The paper has identified a set of best practices to support and inform designers and assessors for software process assessment.

Author Response
We have used some rewording suggested by the authors.