Outcomes assessment pitfalls: challenges to quantifying knowledge gain in a sex education game [version 1; peer review: 1 not approved]

Background: We describe challenges associated with incorporating knowledge assessment into an educational game on a sensitive topic and discuss possible motivations for, and solutions to, these challenges. Methods: The My Future Family Game (MFF) is a tool for collecting data about family planning intentions. The game was expanded to include information about human anatomy and sexual reproduction. To assess the efficacy of the game as a tool for teaching sexual education, we designed a pre-post study with assessments before and after the game which was deployed in three schools in and around Chennai, India in summer of 2018. Results: The pre-post process did not effectively assess knowledge gain and made the game less enjoyable. Although all participants completed the pre-test because it was required to access the main game, many did not complete the post test. As a result, the post-test scores are of limited use in assessing the efficacy of the intervention as an educational tool. This deployment demonstrated that pre-post testing has to be integrated in a way that motivates players to improve their scores in the post-test. The pre-test results did provide useful information about players’ knowledge of human anatomy and mechanisms of human reproduction prior to gameplay and validated the tool as a means of data collection. Conclusion: Adding outcomes assessment required asking players questions about sexual anatomy and function with little or no introduction. This process undermined elements of the initial game design and made the process less enjoyable for participants. Understanding these failures has been a vital step in the process of iterative game design. Modifications were made to the pre-post test process for future deployments so that the process of assessment does not diminish enthusiasm for game play or enjoyment and Open Peer Review


Introduction
The acceptance of games as useful and effective tools for collecting data, educating players, and achieving positive behavior change is growing due to an increase in rigor in the deployment and assessment of applied games (Coovert et al., 2017;Zammitto, 2009). Embedding outcomes assessment within the game itself is often described as an important design principle in building games, largely due to the fact that most games incorporate some form of player feedback and metrics as part of gameplay (Ifenthaler et al., 2012;Van Staalduinen & de Freitas, 2011). There are situations, however, in which such assessment is quite difficult.
The My Future Family Game (MFF) game was initially developed as a tool for collecting information about family planning intentions among adolescents in Mysore, India in 2017. The original idea was to gather information about desired family size and spacing, influencers of the decision-making process, and other data points i . Focus group participant feedback during early stage planning was crucial to the success of the project. Researchers determined that although sex education is included in the standard curriculum for adolescents, many young people do not have basic knowledge about human reproduction (Bertozzi et al., 2018). Including this information in the game would strongly motivate adolescents to play, and was supported by parents and educators as a way of communicating sensitive information.
The first beta of the game was successfully tested on 480 adolescents in summer of 2017 and proved to be a very effective tool for gathering information from a population about which little accurate information is available from other sources (Bertozzi et al., 2018).
Discussion of human anatomy and behavior regarding sex and reproduction is problematic in India (Ismail et al., 2015). Many adolescents receive very little information from their parents or teachers due to cultural taboos (Khubchandani et al., 2014). In designing the MFF game, we were very careful to introduce explicit material slowly and through a process in which it was revealed in context. The game was constructed so that at each point where explicit material is available for the player, the player was asked whether or not they wanted to see it, and then if they agreed, the material was presented in a context that made sense based on the information being gathered.
For example, when players were asked information about when they planned to start dating a possible partner, they were provided with information about the anatomy of the opposite sex ii . When they were asked about the age they planned to marry, after consenting (Figure 1), they were given information about how intercourse works via the animation in Figure 2.
Post-game questionnaires and interviews demonstrated that the game was well-accepted by student players. Analysis of the pilot deployment suggested that the game could function not only as a method of data collection about family planning intentions, but also as a means of assessing preexisting knowledge of and educating players about sex and reproduction (Bertozzi et al., 2018). This created a challenge for the development team because in order to see if the game was effective at teaching adolescents about sex, we needed to know how much they knew about it before they played the game. Given cultural taboos around discussions of sexuality, it was difficult to do so without undermining some of the care that had gone into introducing the topic in the game. The process of designing and deploying a knowledge assessment with the MFF game encountered several pitfalls.

School selection and gameplay protocol
The second deployment of the MFF game was in Chennai, India as part of research conducted by Dr. Swathi Padankatti and her team from the International Alliance for the prevention of AIDS in collaboration with the U.S. based game development  ii Due to cultural taboos in India which would have made it impossible to deploy the game at all, same-sex marriage was not an option in the game. team (Dr. Bertozzi's group at Quinnipiac University) and Dr. Aparna Sridhar at U.C.L.A's School of Medicine. Dr. Padankatti and her team identified three schools willing to participate in the study who could provide a total of 419 student players. Schools were selected based on the research team's pre-existing relationships with administrators, with whom they had previously worked on AIDS education initiatives.
For game deployment, a set of 30 android tablets and headsets were set up in a school classroom, and groups of students in the target age group successively cycled through to play the game and discuss their experience. Groups were not segregated by sex. To ensure comfort and privacy, students were able to move freely around the room to find their preferred space to sit and play. Upon beginning the game, players are first asked to indicate their sex and age, after which the pre-game quiz is triggered prior to initiating the main game. The post-game quiz, with exactly the same structure and questions as the pre-game, appears after completing the game. Given that this was the first field deployment of the revised game, the deployment team reported issues to the U.S. development team after each play session. The issues were collected and organized into topics to be addressed in future revisions of the game.
In addition to collecting data about family planning intentions, the research and development teams had two additional goals: to overcome challenges identified during the first deployment, and to assess knowledge gain via a pre-post testing framework.
MFF (original and modified versions of the game are available here: https://osf.io/gtfu5/wiki/home/ (Bertozzi-Villa et al., 2020). The apks can be installed on any Android tablet or phone.) Challenges from first deployment Following the initial deployment of the game, issues with the original study protocol were identified and addressed for this second deployment.
During the early stages of pilot deployment, teachers stayed in the room during gameplay. They often gave stern instructions on how to behave and ordered students to follow the instructions of the researchers. We realized that this made it impossible for students to experience playing the game as play. Due to the presence of their teachers it felt more like a test that they were required to engage in. To encourage a sense of play, the research protocol was modified early in the first deployment to ask instructors to leave the room during gameplay. Additionally, language was added to the introductory scripts, encouraging students to play the MFF game as a game -they should only do the parts of it that they wanted to, and could stop playing at any time. This protocol was extended into the second deployment.
A usability issue encountered during the first deployment was that players lacked familiarity with the drag and drop interface commonly used on smartphones and tablets. The design team determined that the pre-test was the perfect opportunity to teach players how to use drag and drop so that they would be prepared for it when they reached the game.
The third main issue identified in the pilot involved the post-game questionnaires. These were paper forms filled out by students after playing the game, asking students to qualitatively self-assess knowledge gain and provide feedback on the process of gameplay. While these questionnaires provided valuable feedback and indicated high rates of self-assessed knowledge gain, they were not efficient data-collection strategies. Because forms were filled out on paper, response rates were low and it was not possible to link student feedback to specific test-takers. Positive self-assessment of knowledge gain was encouraging, but not a rigorous method for determining game efficacy. The absence of an evaluative framework for the game was the primary motivation for development of the pre-post testing process.

Development of pre-post assessment
The key educational content of each milestone of the game is outlined in Table 1.
To assess knowledge of these questions while training students on a drag-and-drop interface, the pre-and post-games were designed to show male and female figures in outline, with internal organs visible. A series of 14 anatomy questions covering the full scope of in-game content was presented in a sequence of views. Players answered questions in the pre-test by dragging a word representing a concept (usually with an animation to help explain it) to the correct location on an image ( Figure 3). The structure and content of pre-and post-game tests is identical.
The assessment was designed to correlate with the way information is delivered in the different milestones in the game. In the first deployment we noted that players had a difficult time understanding where different organs were located in the body and what their functions were. In the assessments, we were careful to depict both male and female bodies as a whole at the start of the assessment. The view then zooms in to just the abdomens of the male and female bodies. We added the whole person views in the top right and left of the screen so that players could understand which view of the body was presented to them. It is very difficult to understand how organs are laid out relative to other organs. For example, in the female body, it can be difficult to show the positions of the three apertures of the urethra, vagina, and rectum relative to one another. The additional views were added to minimize this confusion.
Our hope was that the layout of the assessment prior to play would prepare players to approach the anatomy section of the game where they have to drag and drop each body part to its correct location (Figure 4). During the first deployment of the game, it was clear that some players did not understand the difference between the front and side views of the anatomical drawings. In addition to adding the side views in the upper right and left corners of the assessment, we also incorporated them into the minigame. These views update as each organ is dragged into the correct location in the front view.

Analysis
During gameplay, tablets kept timestamped records of every user input. Data on pre-and post-test responses were saved in .csv format for statistical analysis. These datasets include information on the tablet used, the school in which the game was deployed, the self-reported gender of the player, and a unique user id for each run-through of the game. The pre/post data contains no other personalized student data.
Overall pre-and post-test scores, as well as the percent of students who responded correctly to each question, were calculated from individual responses. On the post-test, players who responded "not sure" to every question were logged as having a "null" post test. Score differences between groups were assessed via two-sample t-tests, and pre-to post-test score changes were assessed via one-sample t-tests.

Ethics and consent
The study design was approved by the Institutional Review Board of the Sundaram Medical Foundation, Dr. Rangarajan Memorial Hospital, Chennai, India (IRB # IEC-09/1/2018).
Informed verbal consent was obtained from the principals of the three participating schools following consultation and a gameplay demonstration with each one. Consent was not obtained from student participants. The board deemed oral consent would suffice for the principals, and as the game covers topics which are part of the curriculum, participants' consent was not needed.

Results
The goal of this analysis was to test if embedding the game within a pre/post assessment would accurately assess how much players had learned over the course of the game. Unfortunately things did not turn out the way we planned. Reasons for this are outlined in detail below. We did determine that the pre -test is a useful way to gather information about student knowledge of male and female anatomy and some sexual functions. The results demonstrate that the assessment tool is very helpful in demonstrating which schools are doing better and specifically which topics are better understood.
Pitfall one -Pre-test made the game feel like a test As described in the methods, considerable care was taken in the game design to encourage a sense of play and remove the pressure associated with an examination. By introducing a pre-test, however, we recreated the circumstances under which the experience of play was potentially undermined. Students were invited into the room to play a game. However, after they are welcomed to the game, they are presented with an assessment. The deployment team reported that some students were concerned that they did not have the "right" answer and wanted to be able to go back and correct their previous answers during the pretest. Given that the Indian system of education heavily relies on test scores and impactfully rewards those who test well, these students appeared very motivated to "do well" as soon as they realized it was an assessment.
To counter this, the researchers repeatedly stated that they should just answer what they knew and then go on to the game, but this clearly affected the experience. We learned that in future deployments, we need to add more context and less pressure to the pre-test to ensure players understand that they will not be criticized or penalized for not knowing the answers.
Pitfall two -Language issues with terms for sexual parts and functions To be as accessible as possible to players at any reading comprehension level, the game includes as little text as possible and communicates most information through graphics, audio and animation. This is especially important when discussing information about sexuality because these terms may not be familiar to students. However, the inclusion of the pre-post assessment introduced a great deal of technical vocabulary in English before the gameplay began. All of the schools included in the study had instruction in English, but it was unclear if terms like testicles, ovaries, urine and feces were well understood by players ( Figure 5). Although education about sexual functions is technically part of the educational curriculum for all students in India, the content is not actually taught in many schools due to cultural reluctance to discuss sexuality. In the results that follow, it is possible that some of the variation in scores on the pre-test is related to differences in knowledge of terms rather than differences in knowledge of sexual/anatomical functions.

Pitfall three -Low participation in the post-test
While the transferal of the post-test questionnaire into a digital framework did allow for personalized tracking of results, there were challenges to collecting post-game information.
Qualitative feedback from the deployment team indicated that, when students came to the end of the game, and saw the same screen they had seen earlier for the pre-test, many simply dragged the tiles to "Not Sure" because it was the fastest way to get to the final screen. Others simply put down the tablet which meant that researchers had to exit the player from that game session (with no responses to the post-test questions) to reset the tablet for the next group of students. Due to the fact that we do not know exactly what happened in all the cases where there appear to be random answers to the post-test, we cannot determine how many students actually answered the questions intentionally. We failed to provide players with a compelling reason to want to engage in the final assessment, which will be corrected in the next deployment.
Assessment of pre-test scores A total of 419 students in three schools completed the pre-test and main game. The schools were selected based on scheduling availability and willingness to participate. The researchers from the IAPA had previously worked with these schools on AIDS education initiatives. Across all schools, the pre-test score was 33.5% on average (SE 1.15%), with substantial variation between schools. In particular, students at School 2 (who had sexual education as a formal part of their curriculum) performed significantly better than students at Schools 1 or 3 (two-tailed t-test p<0.001). Pre-test scores were not significantly different between male and female students at any school ( Figure 6).
Across all schools, students scored slightly better on pre-test questions relating to the anatomy of their own sex compared to the opposite sex, but this effect was not statistically significant (Figure 7). For female anatomy questions, 34.4% (SE 1.86%) of female respondents answered correctly, compared to 30.6% (SE 1.6%) of male respondents (two-tailed t-test p=0.13). For male anatomy questions, 33.3% (SE 1.89%) of female respondents answered correctly, compared to 36.3% (SE 1.8%) of male respondents (two-tailed t-test p=0.25).
As shown in Figure 8, the only questions for which a majority of responses were correct were "Where is urine excreted from a male?" (53.0% correct) and "Where does a lining build up to prepare for pregnancy?" (50.6% correct). For eight of the remaining 12 questions, the correct answer received a plurality of responses, but not a majority. The four questions for which the most frequent response was not the correct answer were "Where sperm exit the body?" (plurality answer "Not Sure", 29.8%), "where menstrual blood is excreted?" (plurality answer "Not Sure", 27.7%), "The organ that becomes erect before intercourse?" (plurality answer "Vagina", 33.7%), and "Where urine is excreted from a female? (plurality answer "Vagina", 35.6%)".
Pre-post test assessment As described above, assessment of knowledge gain was complicated by the large number of students who did not complete the post-test or who rushed through it, answering "not sure" to all questions (173 students, 41.3% of total). We refer to this group as having a "null" post-test. While it is not possible to assess knowledge gain among those with a null post-test, among the 246 (58.7%) students who did attempt the post-test we find on average a 6.27-point score gain between pre-and post-tests (95% CI 3.8-8.75, one-sample t test, Figure 9).
A question-by-question breakdown of pre-vs post-test result among those who attempted the post-test shows the largest knowledge gain around topics of intercourse, egg storage, and sperm movement ( Figure 10).

Discussion
Our initial response to the results was dismay. It appeared that the game was not a useful tool for teaching players because overall there was very little change between pre-and post-test      results. Discussions with the deployment teams and more detailed analysis of the results produced a more nuanced understanding of what happened. When the data for players who did complete both the pre-and post-tests was analyzed separately, there was a modest but notable increase in knowledge. Additionally, we determined that the pre-test was a useful tool for assessing prior knowledge.
We learned a great deal about the difficulty of creating effective pre-post assessments for a game that includes sensitive topics. Adolescents offered a game of this type are already nervous and excited about it. The process of setting up a context in which their current knowledge is assessed needs to be approached carefully. We encountered several pitfalls that complicated the assessment process and which affected the validity of the assessment data. We are able to conclude that using a game to assess current knowledge of reproductive anatomy and processes can be very effective. In order to assess knowledge gain after gameplay, students need to be motivated to fully engage in the post-test assessment. For future deployments of the game, we plan to change the deployment protocol to address the issues discussed in this report and better integrate the prepost testing process in the overall experience.
It is standard practice in applied game development to seamlessly integrate assessment into the existing structure of the game (Klopfer et al., 2018;Serrano-Laguna et al., 2018). As we have shown, this is difficult in a game that deals with a sensitive topic. Our plan going forward is to address this challenge openly in the introduction to the game experience. After players open the tablet, we will have an animated character appear who discusses the fact that what will follow is a game about sexuality and that this is a difficult topic for many people to talk about. After normalizing the idea of embarrassment, the character will then introduce the idea that knowledge is power and that the game will help players learn about things that are important to their future. Then the pre-post will be presented as a challenge…" let's see how much you know now and then see if after you play the game you know all the answers to things you didn't know before." Hopefully, with this context, we will avoid the pitfalls of our Chennai deployment.

Conclusion
This deployment demonstrated that a game-based tool can be an effective means of gathering information. We learned that many adolescents in these schools lack basic knowledge of human anatomy and sexuality, especially given that the students chosen had already received baseline training in HIV prevention and are likely better informed than other students. The deployment also provided us with important information for improving the tool.

Data availability
Underlying data Open Science Framework: Outcomes Assessment Pitfalls: Challenges to Quantifying Knowledge Gain in a Sex Education Game. https://doi.org/10.17605/OSF.IO/WMHCD (Bertozzi-Villa et al., 2020) This project contains the following underlying data prepost.csv (Questions and responses to all pre-and post-tests administered, along with timestamps and other metadata)

Extended data
Pre-and post-test data were analyzed and visualized using R version 3.6.0. All code is available from GitHub (https://github. com/bertozzivill/india-family-planning) and archived with Zenodo (http://doi.org/10.5281/zenodo.3822455 (Bertozzi-Villa, 2020)) Data are available under the terms of the Creative Commons Zero "No rights reserved" data waiver (CC0 1.0 Public domain dedication).

Software availability
An installable and playable version of the game and all data used for analysis is publicly available at Open Science Framework, as described below.