Reimagining the machine learning life cycle to improve educational outcomes of students

Significance Given the rapid proliferation of machine learning technologies in education, we investigate the extent to which development of Machine learning (ML) technologies supports holistic education principles and goals. We present findings from a cross-disciplinary interview study of education researchers, examining whether and how the stated or implied educational objectives of ML4Ed research papers are aligned with the ML problem formulation, objectives, and interpretation of results. Our findings shed light on two translational challenges: 1) integrating educational goals into the formulation of technical ML problems and 2) translating predictions by ML methods to real-world interventions. We use these insights to propose an extended ML lifecycle, which may apply to the use of ML in other domains.

Thus, MOOC designers can "use data to figure out who needs to be contacted and when is the best time to contact," and 23 intervene with automated "nudges" to maintain student engagement. For undergraduates at a large public university * , a 24 targeted nudge and chat bot system was most effective at encouraging students to solve "specific problems that were very 25 administrative in nature," such as "submitting the FAFSA, paying off your balance, registering for courses, and finding the 26 parking office." However, according to P10, "it didn't have a big impact on motivating students to choose a major faster or 27 making some of these bigger, [...] more significant decisions." 28 Another automated intervention is the intelligent tutoring system, where ML models are applied to identify student errors 29 and misconceptions. Two participants identified that these automated interventions have been successful in helping students 30 learn specific structured tasks such as solving algebraic equations or correcting spelling and grammar. However, P11 emphasized 31 that the success of this tool in correcting specific algebra errors "doesn't mean it's going to make a kid love math." 32 While these automated ML-driven interventions alone are not effective at achieving broader learning goals such as motivation 33 and inspiration, they may still serve these ends by fitting into a learning ecosystem composed of teachers, peers, and advisors.

34
For example, the automated feedback that a student receives from an intelligent tutoring system can be part of a multi-pronged 35 approach that includes self-study activities and references such as homework and textbooks in addition to the aforementioned 36 peer, teacher, and advisor support. 37 P11: "If they don't understand the feedback, or if the feedback doesn't seem to help them make progress, then they 38 might go to the teacher or they might go to a peer as part of an ecosystem of resources that they can draw from 39 [...] I can see it as part of a balanced breakfast."

40
One reason why intelligent tutoring systems need to be part of an ecosystem to go beyond encouraging specific behaviors is 41 that broader learning goals are more complex and require more context than the algorithm is equipped to handle. Teachers are thus better equipped to contextualize the behaviors and feedback with outside materials and broader learning 45 goals. The output of ML systems serves as one tool that teachers can use to support students.

46
As aforementioned, interventions focused on feedback and error correction also cannot replace a full teaching ecosystem because they do not address the underlying need for student inspiration and motivation. P11 points out that these intelligent 48 tutoring systems can't "make kids love math" or address why some kids "don't want to do computer science." Perhaps this also 49 points to a lack of attention in the machine learning community to these tasks of promoting student motivation and inspiration.

50
Under the current state of the art, the additional context and personal interactions from teachers and academic advisors are 51 essential to filling in the learning ecosystem surrounding highly structured ML tasks.  While predictive models in automated tutoring systems can be successful as teaching tools for specific tasks, P11 warns 56 that student outcomes may not improve if "instead of a teaching tool, it becomes the teacher." In this case, the usage of 57 the automated tutoring system has overstepped the limited intervention scope that was intended to improve outcomes. P11 58 identifies two main reasons underfunded schools are more susceptible to this overstepping. First, in underfunded schools, 59 administrators or parents might "think the teachers don't know what they're doing, or the teachers are over worked, which 60 they are, and so way more is automated in those environments than should be" (P11). Second, higher income schools often 61 possess the resources to use predictive automated tutoring tools as "part of a balanced breakfast" in tandem with teachers, 62 peers, and other human resources. Successful deployment in high performing or well-resourced schools may lead to "schools 63 in struggling districts [thinking] it's a silver bullet, and it's not used in that same way" (P11). Without the ample teaching 64 resources surrounding the automated tutoring tool, the under-resourced schools end up applying the tutoring tool in place of other needful interventions which does not lead to the same learning outcomes for the student.
In a similar vein to automated tutoring systems, automated grading algorithms can also be susceptible to misuse, with under-resourced classrooms being particularly vulnerable. Deploying grading algorithms for standardized testing can facilitate 68 an extreme version of teaching to the test, according to P05. While human graders and rubrics for writing assessments on 69 standardized tests are already formulaic to some extent, automated grading systems are susceptible to strategic gaming that is 70 misaligned with learning goals. P11 comments that in order to appease the automated grader, students may change aspects of 71 their inputs "without understanding why, [...] except that this is what categorizes them as correct versus incorrect." As a 72 specific example, P11 found the automated essay grader for the GRE to be particularly brittle in ways that an English teacher 73 would otherwise likely catch: Without adequate teacher support, such usage of an automated grading tool may not help students develop independent 77 writing skills beyond catering to a specific formula.

78
Since cost-cutting is a powerful driving force behind the adoption of ML4Ed technologies in schools, P05 and P11 worry that 79 students from disadvantaged backgrounds will be disproportionately exposed to technologies that do not serve their learning    There are several extensions to our methodology which could lead to further insights in education and beyond. Interviews 100 with education researchers produce a unique set of discussions which is distinct from interviews with students or teachers.

101
Thus, further fieldwork with direct stakeholders would provide a different and valuable dimension of insight. It may also be 102 illuminating to supplement the qualitative interview study with quantitative statistics from a larger set of applied ML papers.

103
For other domains such as healthcare and criminal justice, this methodology may be applied directly to examine ML research 104 papers and perhaps uncover subtle tensions that were less prevalent in education. 105 Finally, in the critical evaluation of research practices, another avenue to explore is the degree of cross disciplinary

Supplementary Materials and Methods
We include the following supplemental information describing our materials and methods, with file names and descriptions of 123 external supplemental files.
1. Demographic and occupational data for interview study participants: Table S1 lists all participants' self-reported 125 occupations, research areas, genders, and races/ethnicities. 126 2. Pre-interview Survey (SI Dataset S1): This document contains a blank pre-interview survey that was filled out by all 127 interview participants prior to the interview.  Table S2 summarizes the education topic keywords for the 20 papers that were presented in the pre-interview survey 134 as candidates for discussion in interviews.

135
• Table S3 summarizes the education topic keywords for the 9 papers that were actually discussed in interviews.

144
• Figure S1 lists all codes used, as well as the number of interviews in which each code appears.  SI Dataset S1 (pre_survey.pdf) 148 This document contains a blank pre-interview survey that was filled out by all interview participants prior to the interview.

SI Dataset S2 (list_papers.xlsx)
150 This spreadsheet provides a list of all papers presented to interview participants in the pre-interview survey, including 151 title, year of publication, full abstract, conference, paper authors, and education topic keywords (labeled by the authors).

152
The spreadsheet also includes participant responses to indicators of interest and participants who discussed the paper in an 153 interview.  Table S3. Summary of education topic keywords among the 9 ML4Ed papers actually discussed in interviews.

155
This document contains a summary of participants' responses to selected pre-interview survey questions, including length of 156 time in current position, experiences in the education sector, the number of interview participants who marked each paper as