Guidelines for using empirical studies in software engineering education

10 Software engineering education is under constant pressure to provide students with industry-relevant knowledge and skills. Educators must address issues beyond exercises and theories that can be directly rehearsed in small settings. Industry training has similar requirements of relevance as companies seek to keep their workforce up to date with technological advances. Real-life software development often deals with large, software-intensive systems and is influenced by the complex effects of teamwork and distributed software development, which are hard to demonstrate in an educational environment. A way to experience such effects and to increase the relevance of software engineering education is to apply empirical studies in teaching. In this paper, we show how different types of empirical studies can be used for educational purposes in software engineering. We give examples illustrating how to utilize empirical studies, discuss challenges, and derive an initial guideline that supports teachers to include empirical studies in software engineering courses. Furthermore, we give examples that show how empirical studies contribute to high-quality learning outcomes, to student motivation, and to the awareness of the advantages of applying software engineering principles. Having awareness, experience, and understanding of the actions required, students are more likely to apply such principles under real-life constraints in their working life. 11

part of the trigger material provided to students. Experiments can also be used as part of project-based 88 learning (Blumenfeld et al., 1991), where students actively explore real-world challenges and problems. 89 Instructors can introduce experiments when important decision-making and knowledge acquisition needs 90 emerge. 91 In order to support the design of constructive education with experiments embedded, as well as to 92 support experimentation within more traditional teaching, teachers would benefit from guidelines or sets 93 of ready-made experiment templates that they could use either when planning or dynamically during  findings. An important conclusion drawn from the overview is that successful observation of a phenomenon 117 as part of an empirical study should not be an end in itself. Rather, students should have enough time to 118 get familiar with the related ideas and concepts associated with the phenomenon.

119
Empirical Studies in Software Engineering Education In software engineering, experimentation was es-120 tablished in the 1980s. Basili et al. (1986) were among the first to present a framework and process for 121 experimentation. Since then, software engineering experiments in classroom settings have become more 122 common. However, the focus of most of such experiments has been to gain research knowledge, with 123 students participating as research subjects. Less attention has been paid to using empirical studies with 124 an educational purpose in mind, where the experiment has an explicit didactic or experiential role. Few 125 curricula are available that include the execution of empirical studies as an integral part of a lecture (e.g., (2007) finds that students' learning process is improved and that including carefully designed experiments 142 into software engineering courses increases their motivation. A large majority (91%) of students who 143 participated as subjects in the experiments found them useful, and the number of high-passes increased by 144 41% after introducing experiments. 145 While many articles report on empirical studies using student subjects, and some articles report on 146 the educational benefits of such studies for students, few papers address empirical studies as an overall 147 strategy for software engineering education. In particular, there is a lack of guidance for using empirical 148 studies in software engineering education in cases where students may not only be research subjects but 149 could also be involved in carrying out the studies. An overview that discusses different types of empirical 150 studies, their suitability for education, as well as challenges with respect to their execution is missing. 151 3 RESEARCH APPROACH 152 The goal of this paper is to develop guidelines that help teachers integrate empirical instruments in 153 software engineering education. The guidelines are based on a reflective analysis of our experiences with 154 teaching courses that use empirical elements to support learning objectives. A reflective approach has 155 been recognised by many educational researchers as a prerequisite for effective teaching (e.g. Hatton and 156 Smith (1995), Cochran-Smith (2003), Jones and Jones (2013)). Reflective practice, with roots in the works 157 of Dewey (1935) and Schön (1983), calls for continuous learning through deliberate reflection in and on 158 action. Using empirical instruments in software engineering education is a way to encourage students 159 to reflect, but teachers should do the same. This paper represents one outcome of reflection-on-action: 160 we analyse materials, assignments, notes, course syllabi, schedules and structures, evaluation data, and 161 recollections of important factors in a number of our own courses, and derive guidelines that we believe 162 would help teachers implement similar courses. 163 Our approach is mainly qualitative and has proceeded from gathering a list of study types through 164 analysis of materials and experiences relevant to each study type to the guideline proposed in this paper. 165 Here, analysis refers to categorisation of materials and identification of connections and relationships 166 between categories. Our main goal of developing the guideline helped to scope our investigation, and we 167 thus left out material which did not serve this goal. We began by sifting through parts of the published 168 literature on software engineering education and methods in order to shape a first outline of a taxonomy of 169 study types. In particular, we were influenced by Höst (2002) and Carver et al. (2003) when considering 170 software engineering education, and by Shull et al. (2008) and Kitchenham et al. (2015) when considering 171 the methodological aspects. Our search was purposive rather than systematic, as we sought to construct a 172 taxonomy (see Section 4) for use in the guidelines rather than for the purpose of representing the state of 173 the art in the scientific literature. 174 After constructing the taxonomy, we analysed qualitative data from our own courses and arranged it 175 according to five categories: (1) learning goals, purposes, challenges, and validity, (2) establishing context 176 and goals, and determining a study type, (3) motivating students, (4) scheduling, (5) other considerations. 177 We summarised the qualitative data in each category by removing the details specific to our courses and 178 generalising the insights so that they can be applied more broadly. We then constructed the guideline by 179 cross-referencing the categories so that the purpose, challenges, and validity concerns relevant for each 180 study type is shown. The result is given in Section 5.

181
Finally, we revisited the material from our courses and picked examples that illustrate how we tackled 182 some of the choices teachers face when using empirical instruments for education. We also addressed the 183 specific question of evaluating our teaching by providing data from formal as well as informal evaluation 184 (see Section 6). This serves as a first validation of the guidelines.

187
The software engineering literature includes a number of empirical studies with students, and often these 188 studies were conducted in an educational setting. In this section, we give an overview of (empirical) study 189 types utilised in software engineering education. We list common instruments from empirical software 190 engineering and provide examples of how these instruments can be applied to teaching. The overall goal 191 of this section is to summarise different study types that can be used in software engineering education.

192
The summary supports the development of an initial common taxonomy that categorises study types. The  Case studies help answering explanatory questions of the type "How?" or "Why?" They should be 212 based on an articulated theory regarding the phenomenon of interest (Yin, 2009    For instance, if the efficiency of a particular method is subject to investigation, one experiment group is 254 assigned to solve a problem with the "new" method, while another group works on the same task, but 255 using another method. Results are then compared, e.g., to accept or reject a hypothesis. Model (Tuckman, 1965), and in a complementing experiment, students could experience the effects of

295
Continuous experimentation refers to a constant series of experiments to test the value of software 296 capabilities such as features early in its design process (Fagerholm et al., 2014a(Fagerholm et al., , 2017 For researchers, continuous experimentation helps to better understand processes, methods, techniques, 329 tools, and organizational constraints regarding building the "right" software.

330
From the teaching perspective, continuous experimentation helps students understand the connection 331 between software development techniques and business. Since such experiments must begin by analysing 332 product-related assumptions, students naturally come into contact with the product's business model.

333
They must then make the link between such assumptions and the corresponding technical implementation 334 and devise an experiment which allows them to refute or support the highest-priority assumption, yielding 335 evidence for a product-related decision. Continuous experimentation can thus foster the awareness of 336 relevant criteria for software beyond cost, reliability, and effort, e.g., usability, usefulness, success (e.g.,

337
contribution to a higher level organizational goal), and scalability (e.g., monetization from a significant 338 amount of users).  calibrating the model to a specific context can be done efficiently.

360
Simulation may be a suitable teaching aid in many situations, but should be used only when a valid 361 model can be obtained. Otherwise, there is a risk that students observe effects that are not realistic and 362 thus incorrect learning might occur. Well-researched models with extensive validation are necessary.

363
Other disciplines such as mechanical engineering or molecular chemistry already use simulation to analyse 364 technologies and processes and thereby reduce the need for real experiments. In software engineering, 365 this trend is still focused towards product aspects such as understanding the dynamic behaviour of control 366 software. However, simulation has already been applied successfully for understanding and analysing 367 software processes as well as for educational purposes. Process simulation can be combined with real 368 software engineering experiments -for example, by using empirical data to calibrate a model or by 369 comparing such data with simulation results -or used as such.  • Both can be done in parallel (e.g., to broaden the scope of the experiment) 384 From the research perspective, software process simulation can be seen as an additional, efficient  simultaneously perform the study. The student is thus a participant-observer in such studies.

408
Individual studies depend on the setting in which they are carried out, e.g., requirements for a semester In many cases, they must participate in OSS and build products on OSS platforms in order to access 487 customers who are already using them. Learning how to function in this context, e.g., working in a 488 self-organising virtual team requires particular knowledge and skills.

489
Therefore, OSS projects are fruitful grounds to set up a sophisticated teaching environment. Individual

514
In this section, we develop an experience-based guideline to integrate empirical studies with software 515 engineering courses. We base the guideline on experiences gathered from our own software engineering 516 courses, categorised from several perspectives. We first generalise common purposes, challenges, and 517 validity considerations. These serve to determine the appropriateness of a particular study in a given 518 context. We discuss appropriateness from two perspectives: (1) teaching at universities and (2) industry 519 training. Finally, we share our experiences, discussing several aspects to be considered when integrating 520 empirical studies with software engineering courses, e.g., motivation, scheduling, and effort.

522
To summarise the aforementioned kinds of empirical studies, we create the taxonomy presented in 523 Tables 2-4. In this initial taxonomy, we include different purposes, challenges, and validity constraints 524 to support the categorisation of study types, and the analysis of appropriateness in certain contexts. We 525 identified a total of ten purposes, describing major learning goals associated with empirical studies 526 in software engineering teaching that we consider important (Table 2). Complementing the purposes, 527 we identified eight challenges that should be taken into account when designing empirical studies for 528 educational purposes ( Manuscript to be reviewed Computer Science Table 2. Summary of learning goals and purposes for empirical studies in education.

P01
Learn to formulate a research problem. Students face a (real-world) problem that needs investigation. Therefore, the learning goal is to: • Capture the problem.
• Formulate hypotheses regarding users or customers and their behaviour. Due to the complexity of realistic, real-world settings, this task is demanding for, e.g., formulating a problem in a scientifically sound way, but keeping in mind the (industry) partners' needs. P02 Learn to collect relevant data. Collecting data in realistic settings is a demanding task, as data is usually scattered across different sources. The learning goal is to develop a meaningful data collection strategy that includes data from multiple sources within a setting and, optionally, backed up by further external data (from outside a given setting).

P03
Learn to analyse real-life data. Real world situations are often incomplete or confidential thus hampering data analyses. The learning goal is to develop a data analysis strategy to overcome limited data. P04 Learn to draw conclusions. Based on collected and analysed data, the overall learning goal is to draw conclusions. Thus, in the (realistic) setting, students need to learn to: • Gather empirical evidence on which conclusions are based.
• Test theories and/or conventional wisdom based on evidence.
• Draw conclusions from (limited) data and develop a strategy to utilise findings in practice.
The purpose is to gather findings or evidence, and to analyse the findings for relevance in the respective setting. Eventually, findings must contribute to solving the original problem and, thus, another learning goal is to develop transfer strategies to support utilisation of the findings in practice.

P05
Learn to experience and solve a specific problem. A major purpose is to cause people to experience certain situations and to develop situation-specific solution strategies or approaches. This leads to: • Experience regarding the problem-solution relation, e.g., understanding of the relationship between user behaviour and software design.
• Increased knowledge about a problem (domain).
• Increased knowledge about technology and methods.
• Increased knowledge about potential/feasible solutions and/or solution patterns. Skills addressed by this learning goal are basic prerequisites that allow for developing solutions in general, as these skills address a specific problem, but also allow for developing transferable knowledge that can be applied to different contexts. P06 Develop a software artefact. In software engineering, software artefacts, especially prototypes, serve the (early) analysis of a specific problem. For this, prototypes allow for implementing and demonstrating solution strategies. The learning goal thus comprises: • Create a (software) prototype to demonstrate a solution approach/strategy (feasibility study).
• Create artefacts to elaborate potential solution approaches/strategies for dis-/advantages (comparative study).
• Create artefacts to establish (quick) communication and feedback loops. Software artefacts in general and prototypes in particular serve the elaboration of a problem, and help to understand the potential solutions. That is, such artefacts pave the way to the final solution. P07 Coaching. Another learning goal is to make stakeholders familiar with new methods and tools. Hence, utilisation of the new methods/tools need to be trained, i.e., develop and train necessary skills. P08 Change of culture. Continuous experimentation comprises a number of the other learning goals. However, continuous experimentation is more of a general organisational question than a project-specific endeavour. Therefore, utilising continuous experimentation also implies a cultural change toward experimentation in the implementing organisation. P09 Learn about the impact. Specific behaviour or decisions impact a system and/or a team, e.g., changing requirements or fluctuation in team composition. Therefore, it is important to learn about the effects that certain behaviour and decisions have in large and/or dynamic contexts. P10 Learn about long-term effects. Apparently "local" decisions might cause "global" effects. Thus, it is important to know about the long-term and/or snowballing effects caused by single decisions, e.g., a shortcut in the architecture leads to increased maintenance cost (technical debt).

C01
Finding or creating relevant cases. The major challenge is to find and define proper and relevant cases, which bears some risks: • A case may become irrelevant while conducting a study (e.g., changing environment, changing context parameters).
• A study might go into an unexpected direction (learning curve and, in response, focus shift).
• A relevant case must be narrowed down to the participating subjects, e.g., students have different skills and goals than professionals. Cases must be balanced, e.g., learning goals must be achieved regardless whether the original case loses its relevance (procedural over technical knowledge) or students need to finish a thesis regardless of whether industry partners can apply study findings. C02 No guaranteed outcome. If a problem was found, there is no guarantee that a study will lead to an outcome. Furthermore, immediate applicability of the outcomes is not guaranteed, which means extra work for industry to transfer results into product development.

C03
Time constraints. Apart from the appropriateness of the actual problem, time constrains limit the study. Time constraints can occur as: • Limitations dictated by the curriculum/course schedule.
• Limitations dictated by industry schedules, e.g., product development cycles.
• Limitations dictated by individual schedules, e.g., students that are about to finish their studies. Therefore, time constrains, together with resource limitations, define the basic parameters that affect the study objects (problem, potential/achievable solutions, completeness of results, validity of outcomes, and so forth). C04 Resource limitations. Studies require resources and, thus, availability of resources limits the study. Resource limitation can occur as: • Availability of (the right) students, e.g., if a study requires students with a specific skill profile.
• Motivation of students to participate in a study (personal vs. study goals).
• Availability of industry resources (personnel tied to a study).
• Options to adequately integrate the study with (running) company processes. Especially the availability is a critical factor. For instance, while one experiment consumes resources once, repetition and replication require a long-term commitment regarding resource availability, which implies significant investments of time and/or money. In order to make resources available, participating partners need to receive a sufficient benefit, which is often hard to define in empirical studies. C05 Limited access to data. Although it is one purpose in terms of learning goals, defining adequate hypotheses and variables that can be investigated in a course is challenging. Proper measurements must be defined, taking into account that potentially not all data is available, e.g., confidential data. Especially access to user data is challenging (a way out could be utilising OSS projects), as this data is usually strictly confidential. C06 Built-in bias. A special problem is bias. Each particular setting comes with an inherent set of biases, e.g.: • Students' special skills affect the study, and students that are trained in advance of the study affect the outcomes.
• Too much or too little context knowledge of the subjects affects the study.
• Competing goals of the participants (especially students vs. practitioners) affect the study, e.g., students might try to optimise a study to achieve better grades while compromising the study goals. Empirical studies suffer from certain limitations, and in the context of teaching, special attention needs to be paid to bias and threats to validity. C07 Communication. Empirical investigations create knowledge, data, and potentially software artefacts. Therefore, results need to be quickly communicated to the participants. Quick feedback helps to, e.g., determine the relevance of results, appropriateness of the instrument, and determining necessary adjustments. Thus, fast feedback loops are necessary. C08 Creating a Simulation Model. For simulation-based research/teaching, setting up a simulation model is a demanding task, which consumes time and resources thus generating cost. The entire domain under consideration must be captured to create a model that allows for generating useful data.

Manuscript to be reviewed
Computer Science

564
Making contact with industry and the prospects of finding a job or starting a company, foster students' 565 motivation to actively participate in courses, and may contribute to higher motivation, engagement, and Nevertheless, apart from all potentially positive motivating drivers, a major driver for students is to 577 get the best possible grade. Also, the number of credits must reflect the effort required to conduct the 578 study. For students, the amount of time required to receive a credit point is an important consideration.

579
Since empirical studies are demanding in terms of effort, and credits form the compensation, software 580 engineering courses that include empirical studies must adequately "remunerate" the students for their 581 efforts.

583
Having defined the goals and acquired (motivated) students and, optionally, partners from industry, 584 the challenges C03 and C04 (Table 3)  Manuscript to be reviewed Computer Science Table 7. Further study selection criteria for different study types. Each study type is ranked relative to the others on three levels and may span more than one level (LO: low, ME: medium, HI: high). LO  ME  HI  LO  ME  HI  LO  ME  HI  LO  ME  HI  LO  ME  HI Degree of execution control

Individual Studies
Motivation to participate in a study Motivation created by the study teachers. Besides the coaching, teachers monitor the correct application of empirical methods to 607 collect and analyse data.

608
Planning the study and justifying the study plan with all time constraints needs to be done carefully, and 609 requires the commitment of all participants to ensure availability of personnel and resources.

611
Apart from the criteria already discussed, we wish to highlight some further criteria that may influence the 612 selection of study types for educational purposes. First, in Table 7, we summarize well-known criteria 613 from literature (e.g., Wohlin et al. (2012)) and further criteria that we consider relevant for study type 614 selection. The table includes an experience-based rating for the criteria. However, this rating has to be 615 considered as a subjective recommendation, as it is hard to precisely define, e.g., the degree of motivation 616 or student satisfaction. 617 We note that the knowledge and skill level of students should also be taken into account when selecting 618 and tailoring an empirical instrument for teaching. In Table 8, we provide an experience-based assessment 619 of how different study types can be adjusted for different levels of students. Two student levels are  Manuscript to be reviewed Computer Science Table 8. Adjusting study types to student levels. Length of study is indicative and based here on European standards.

Experiment
Simple experiments with few variables. Experiment design given.
More complex multivariate experiments. Own experiment design.

Case Study
Limited topics, restricted to chosen context, few informants. Little or no generalisation. Exploratory, descriptive or intrinsic case studies.
Topics related to well specified software engineering areas. Some generalisation. Limitations of generalisation fully analysed. All case study types. Continuous Experimentation Rudimentary practice with synthetic scenarios. Focus on understanding basic steps such as identifying assumptions, creating hypotheses, and collecting data.
More advanced scenarios or limited real-life experiments. Focus on drawing conclusions from data and understanding limitations.

Simulation
Using ready-made simulation models and given data to explore topics through simulation.
Exploring the effect of changes in models using given data or how ready-made models behave with student-collected data. Some exploration with creating simulation models.

Individual Studies
Focus on finding and summarising existing research.
Focus on answering specific research problems by applying existing research and own data collection. No requirement of scientifically novel results.
structure both without and with empirical instruments, we can present a number of experiences and a 643 comparison.

644
Formal Evaluation In Table 9, we present the comparison based on the formal course evaluations  Informal Evaluation Besides the formal faculty-driven evaluation, we also performed two informal 653 feedback rounds in the course instance in which we adopted the empirical instruments. We asked the 654 students to write a one-minute-paper that contained the following three questions to be answered in short 655 words:

656
1Note that in Table 9, smaller scores are better.

Manuscript to be reviewed
Computer Science Table 10. Summarized evaluation of the one-minute-papers (winter 2011/2012, TUM).

Positive Aspects
Structure of the topics and the class, Combination of theory and practice, Projects in teams (atmosphere), Self-motivation due to presentations, Continuous evaluation and finding of the final grades

Negative Aspects
Tough schedule, Tailoring of the tasks for the practical sessions was not always optimal Students signed off, just because of the examination procedure

Informal
"Thank you, this was the lecture I learned the most." "Super class, and I loved those many samples from practice." Table 10 shows the summarized results from the informal evaluation: The structure of the class, the were separated (each was located in a separate room), and each group was allowed to use only 685 one communication channel (e-mail and Skype respectively). After the task had been presented 686 to them, the groups were immediately separated to avoid any direct communication, and for each 687 2Since we informed the students about the "experimental" character of this special course in advance, the students did not complain, but welcomed the opportunity to give the feedback to improve their own class.
3In the Star Trek franchise, Kobayashi Maru is a leadership test with a no-win scenario; see https://en.wikipedia.org/ wiki/Kobayashi_Maru.

Manuscript to be reviewed
Computer Science group, a researcher monitored the compliance to the experiment rules. As the students did not have 688 the chance to initially find some agreements, the projects were failures-by-design. The students 689 immediately started to work (they had only 90 minutes to develop the working software), yet, nobody 690 came up with the idea to negotiate a communication protocol first. Therefore, after the deadline, no 691 group could show any working software. In a closing feedback session, we revealed the nature of 692 the experiment and discussed the observations. 693 Formal Evaluation In Table 11, we present the comparison based on the formal course evaluations 694 conducted by the Faculty of Informatics. Although we have only one evaluated instance of this course, 695 we use the same structure as in Table 9 to present the data. The evaluation shows this course to be on 696 approximately the same level as the improved software process modelling course. feedback rounds in the course. We asked the students to write a one-minute-paper (see above). Since the 699 outcomes are actually the same as already presented in Table 10, we only present the informal comments 700 (third question) in Table 12.   The first thesis project investigated a software prototype game and applied usability and user experience 707 evaluation methods to determine whether it fulfilled two sets of criteria: the entertainment value of the 708 game, and the ability to tag photos as a side effect of playing the game. The game itself was implemented 709 by a student team in cooperation with a company, and the thesis writer was part of the implementation 710 team. In this thesis, the game constituted the case and four sources of evidence were used: user interviews, 711 in-game data collection, a questionnaire, and observations from a think-aloud playing session. The thesis 712 can be characterised as an intrinsic case study (Stake, 1995 From the teaching perspective, we experienced that the choice of a real world example rather than an 769 artificial toy example has proved to be successful. For example, the experiment outcome from Kuhrmann 770 et al. (2013a) was a fully implemented process to which the process owner stated that he did not expect the 771 student groups to create "such a comprehensive solution in this little time." Another goal-"let students 772 experience the consequences of their decisions"-was also achieved. For instance, in the course on 773 software process modelling, while implementing the process in a workshop session, we could observe a 774 certain learning curve. One team had a complete design, but selected an inappropriate modelling concept.

775
Later, the team had to refactor the implementation, which was an annoying and time-consuming task, both 776 increasing their awareness of the consequences of certain design decisions. Furthermore, students also 777 experienced how difficult it is to transform informal information or tacit knowledge into process models.

778
The students could also see how difficult it is for individuals to formulate their behaviour in a rule-oriented obtaining the cases, the students themselves learned to be self-directed in their work and gained significant 789 domain knowledge. As thesis supervisors, we found that there was some additional effort in introducing 790 case study methodology to students-methodology courses do not fully prepare students to actually carry 791 out a study of their own, which is to be expected. However, being embedded in the project and receiving 792 feedback from the project environment and its stakeholders meant that it was easy to convince students of 793 the necessity of a structured approach. Once students were up to speed, the extra supervision effort was 794 compensated by more autonomous work on the students' part.

796
The guideline presented in this paper has not been systematically tested in different learning environments.

797
Instead, it represents a starting point based on reflection grounded in teaching practice. We consider the 798 limitations of the study in terms of qualitative criteria for validity (c.f. Creswell (2009)).

799
Internal validity concerns the congruence between findings and reality. In this study, internal validity 800 then concerns how credible the guidelines are in light of the realities of software engineering education.

801
As that reality is constantly changing, the match between guidelines and teaching can never be perfect.

802
Our study has applied triangulation to increase the internal validity of the results. We have utilised several 803 types of teaching in different modes and in different universities, and with different teachers, to obtain a 804 richer set of experiences to draw guidelines from.

805
External validity refers to the extent to which findings can be applied to other situations. As our aim is 806 not theory testing, external validity in this article is about enhancing, as far as possible, the transferability 807 of the results. We argue that the guideline developed herein covers a wide range of teaching and learning 808 situations, and thus can be applied widely in graduate and undergraduate education in software engineering. 809 We have attempted to elucidate the limitations of applying the guideline by mapping study types differently There is a lack of guidance on how to use empirical studies in software engineering education. In order to 822 address this gap, this paper provides an overview of different types of empirical studies, their suitability for 823 use in education, as well as challenges with respect to their execution. We analysed our own teaching and 824 the different studies that we applied as part of it, and reported on selected studies from existing literature.

825
Rather than having students conduct pure research, we opt for including different empirical instruments 826 into software engineering courses as means to stimulate learning.

827
The present paper provides an initial systematisation of empirical instruments from the educational 828 perspective. We derived a set of purposes and challenges relevant for selecting a particular study type.

829
Furthermore, we also discussed validity constraints regarding the results of course-integrated studies.

830
Based on our experiences, we assigned the different purposes, challenges, and validity constraints to 831 the different study types, and we provided further discussion on motivation and scheduling issues. We 832 also defined a set of further study selection criteria to provide an initial guideline that helps teachers to 833 select and include empirical studies in their courses. We believe the guideline could be used in a wide 834 variety of settings. We note that the guideline is limited in that it considers a limited number of study 835 types and learning outcomes -those that they authors have experience with as teaching aids and study 836 purposes. They may not be suitable in situations where significantly different study types or learning 837 outcomes are called for. Since, to the best of our knowledge, no comparable guidelines exist, we cordially 838 invite teachers and researchers to discuss and improve on this proposal. In particular, future work could 839 focus on applying the guidelines in different kinds of software engineering courses and programs, both 840 within academic university education and in industry training. The purposes, challenges, and constraints 841 presented here could thus be further validated, refined, and perhaps extended.

842
Another particular consideration is how to perform student assessment when using empirical studies 843 for educational purposes, in particularly when group work is involved. What should be assessed, how 844 should assessment be performed fairly when many students are involved, and how should, e.g., knowledge 845 of empirical methods, domain knowledge, procedural knowledge, and the quality of outcomes be balanced 846 in the assessment? We believe that the purposes and validity considerations in Table 2 and Table 4 could 847 serve as a starting point for creating rubrics that are relevant for this type of teaching.

848
Finally, further studies are needed to test the effectiveness of courses using the proposed approaches