Abstract
When conducting research with data from smart learning systems, there is a need to protect user identities because the release of personally identifiable information (PII) poses a significant risk to participants and creates a barrier to analyzing data and/or creating open datasets. Massive open online courses (MOOCs) are a good example of learning systems where PII concerns may hamper data analysis, the well-being of users, and system innovation. PII is particularly hard to locate and clean because of the variations in formatting, texts, and assignments found in unstructured data. In particular, identifying and removing students’ names has proven difficult. This study examines the potential to use large, pre-trained language models to de-identify MOOC data and compares performance on these language models to human annotations. On a validation set, a pre-trained language model fine-tuned using spaCy default hyperparameters achieved 97% recall of student names in the validation set, including partial matches, and 30% precision. On a larger, unseen test set (n = 3,077), the model achieved 93% recall and 24% precision. The majority of the false positives leading to lower recall in the test set were known names belonging to authors and/or lecturers. The results of the ensemble approach used here show considerable promise for a difficult de-identification task and indicate that automated de-identification is, likely, mature enough for use on some education datasets. Clearing PII from smart learning systems would ethically protect learners within the systems, allowing for the release of large datasets that could be analyzed for intelligent insights to forward innovation within smart learning systems.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Anjum, M. M., Mohammed, N., Jiang, X.: De-identification of unstructured clinical texts from sequence to sequence perspective. In: Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, pp. 2438–2440 (2021). https://doi.org/10.1145/3460120.3485354
Bosch, N., Crues, R. W., Shaik, N.: “Hello, [REDACTED]”: Protecting student privacy in analyses of online discussion forums. In: Proceedings of the 13th International Conference on Educational Data Mining, Vol. 11 (2020)
Chen, B., Chang, Y.H., Ouyang, F., Zhou, W.: Fostering student engagement in online discussion through social learning analytics. Internet Higher Educat. 37, 21–30 (2018). https://doi.org/10.1016/j.iheduc.2017.12.002
Crossley, S., Paquette, L., Dascalu, M., McNamara, D.S., Baker, R.S: Combining click-stream data with NLP tools to better understand MOOC completion. In: Proceedings of the Sixth International Conference on Learning Analytics & Knowledge, pp. 6–14. New York, NY, USA (2016)
Deming, D.J., Goldin, C., Katz, L.F., Yuchtman, N.: Can online learning bend the higher education cost curve? Am. Econom. Rev. 105(5), 496–501 (2015). https://doi.org/10.1257/aer.p20151024
Ferrández, O., South, B.R., Shen, S., Friedlin, F.J., Samore, M.H., Meystre, S.M.: Evaluating current automatic de-identification methods with Veteran’s health administration clinical documents. BMC Med. Res. Methodol. 12(1), 109 (2012). https://doi.org/10.1186/1471-2288-12-109
Gayed, J.M., Carlon, M.K.J., Oriola, A.M., Cross, J.S.: Exploring an ai-based writing assistant’s impact on English language learners. Comput. Educat. Artific. Intell. 3, 100055 (2022). https://doi.org/10.1016/j.caeai.2022.100055
Honnibal, M., Montani, I., Van Landeghem, S., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python [Python]. Explosion AI (2020)
Jiang, R., Banchs, R.E., Li, H.: Evaluating and combining named entity recognition systems. In: Proceedings of the Sixth Named Entity Workshop, pp. 21–27
Kleinberg, B., Mozes, M., Arntz, A., Verschuere, B.: Using named entities for computer automated verbal deception detection. J. Forensic Sci. 63(3), 714–723 (2018). https://doi.org/10.1111/1556-4029.13645
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. http://arxiv.org/abs/1907.11692
Murugadoss, K., Rajasekharan, A., Malin, B., Agarwal, V., Bade, S., Anderson, J.R., Ross, J.L., Faubion, W.A., Halamka, J.D., Soundararajan, V., Ardhanari, S.: Building a best-in-class automated de-identification tool for electronic health records through ensemble learning. Patterns 2(6), 100255 (2021). https://doi.org/10.1016/j.patter.2021.100255
Nakayama, H., Kubo, T., Kamura, J., Taniguchi, Y., Liang, X.: doccano: Text Annotation Tool for Human (2018). https://github.com/doccano/doccano
Nanda, G., Douglas, K.A.: Machine learning based decision support system for categorizing MOOC discussion forum posts. In: Proceedings of the 12th International Conference on Educational Data Mining (EDM 2019), pp. 619–622 (2019)
Meystre, S.M., Friedlin, F.J., South, B.R., Shen, S., Samore, M.H.: Automatic de-identification of textual documents in the electronic health record: a review of recent research. BMC Med. Res. Methodol. 10, 70 (2010). https://doi.org/10.1186/1471-2288-10-70
Presidio—Data Protection and Anonymization API. (2022). [Python]. Microsoft. https://github.com/microsoft/presidio. Original work published 2018
Young, E.M.: Educational privacy in the online classroom: FERPA, MOOCS, and the Big Data Conundrum. Harvard J. Law Technol. 28(2) (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Holmes, L., Crossley, S., Hayes, N., Kuehl, D., Trumbore, A., Gutu-Robu, G. (2023). De-Identification of Student Writing in Technologically Mediated Educational Settings. In: Dascalu, M., Marti, P., Pozzi, F. (eds) Polyphonic Construction of Smart Learning Ecosystems. SLERD 2022. Smart Innovation, Systems and Technologies, vol 908. Springer, Singapore. https://doi.org/10.1007/978-981-19-5240-1_12
Download citation
DOI: https://doi.org/10.1007/978-981-19-5240-1_12
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-5239-5
Online ISBN: 978-981-19-5240-1
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)