De-Identification of Student Writing in Technologically Mediated Educational Settings

Holmes, Langdon; Crossley, Scott; Hayes, Nick; Kuehl, Dylan; Trumbore, Anne; Gutu-Robu, Gabriel

doi:10.1007/978-981-19-5240-1_12

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 908))

Included in the following conference series:

Conference on Smart Learning Ecosystems and Regional Development

194 Accesses
1 Citations

Abstract

When conducting research with data from smart learning systems, there is a need to protect user identities because the release of personally identifiable information (PII) poses a significant risk to participants and creates a barrier to analyzing data and/or creating open datasets. Massive open online courses (MOOCs) are a good example of learning systems where PII concerns may hamper data analysis, the well-being of users, and system innovation. PII is particularly hard to locate and clean because of the variations in formatting, texts, and assignments found in unstructured data. In particular, identifying and removing students’ names has proven difficult. This study examines the potential to use large, pre-trained language models to de-identify MOOC data and compares performance on these language models to human annotations. On a validation set, a pre-trained language model fine-tuned using spaCy default hyperparameters achieved 97% recall of student names in the validation set, including partial matches, and 30% precision. On a larger, unseen test set (n = 3,077), the model achieved 93% recall and 24% precision. The majority of the false positives leading to lower recall in the test set were known names belonging to authors and/or lecturers. The results of the ensemble approach used here show considerable promise for a difficult de-identification task and indicate that automated de-identification is, likely, mature enough for use on some education datasets. Clearing PII from smart learning systems would ethically protect learners within the systems, allowing for the release of large datasets that could be analyzed for intelligent insights to forward innovation within smart learning systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Anjum, M. M., Mohammed, N., Jiang, X.: De-identification of unstructured clinical texts from sequence to sequence perspective. In: Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, pp. 2438–2440 (2021). https://doi.org/10.1145/3460120.3485354
Bosch, N., Crues, R. W., Shaik, N.: “Hello, [REDACTED]”: Protecting student privacy in analyses of online discussion forums. In: Proceedings of the 13th International Conference on Educational Data Mining, Vol. 11 (2020)
Google Scholar
Chen, B., Chang, Y.H., Ouyang, F., Zhou, W.: Fostering student engagement in online discussion through social learning analytics. Internet Higher Educat. 37, 21–30 (2018). https://doi.org/10.1016/j.iheduc.2017.12.002
Article Google Scholar
Crossley, S., Paquette, L., Dascalu, M., McNamara, D.S., Baker, R.S: Combining click-stream data with NLP tools to better understand MOOC completion. In: Proceedings of the Sixth International Conference on Learning Analytics & Knowledge, pp. 6–14. New York, NY, USA (2016)
Google Scholar
Deming, D.J., Goldin, C., Katz, L.F., Yuchtman, N.: Can online learning bend the higher education cost curve? Am. Econom. Rev. 105(5), 496–501 (2015). https://doi.org/10.1257/aer.p20151024
Article Google Scholar
Ferrández, O., South, B.R., Shen, S., Friedlin, F.J., Samore, M.H., Meystre, S.M.: Evaluating current automatic de-identification methods with Veteran’s health administration clinical documents. BMC Med. Res. Methodol. 12(1), 109 (2012). https://doi.org/10.1186/1471-2288-12-109
Article Google Scholar
Gayed, J.M., Carlon, M.K.J., Oriola, A.M., Cross, J.S.: Exploring an ai-based writing assistant’s impact on English language learners. Comput. Educat. Artific. Intell. 3, 100055 (2022). https://doi.org/10.1016/j.caeai.2022.100055
Article Google Scholar
Honnibal, M., Montani, I., Van Landeghem, S., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python [Python]. Explosion AI (2020)
Google Scholar
Jiang, R., Banchs, R.E., Li, H.: Evaluating and combining named entity recognition systems. In: Proceedings of the Sixth Named Entity Workshop, pp. 21–27
Google Scholar
Kleinberg, B., Mozes, M., Arntz, A., Verschuere, B.: Using named entities for computer automated verbal deception detection. J. Forensic Sci. 63(3), 714–723 (2018). https://doi.org/10.1111/1556-4029.13645
Article Google Scholar
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. http://arxiv.org/abs/1907.11692
Murugadoss, K., Rajasekharan, A., Malin, B., Agarwal, V., Bade, S., Anderson, J.R., Ross, J.L., Faubion, W.A., Halamka, J.D., Soundararajan, V., Ardhanari, S.: Building a best-in-class automated de-identification tool for electronic health records through ensemble learning. Patterns 2(6), 100255 (2021). https://doi.org/10.1016/j.patter.2021.100255
Article Google Scholar
Nakayama, H., Kubo, T., Kamura, J., Taniguchi, Y., Liang, X.: doccano: Text Annotation Tool for Human (2018). https://github.com/doccano/doccano
Nanda, G., Douglas, K.A.: Machine learning based decision support system for categorizing MOOC discussion forum posts. In: Proceedings of the 12th International Conference on Educational Data Mining (EDM 2019), pp. 619–622 (2019)
Google Scholar
Meystre, S.M., Friedlin, F.J., South, B.R., Shen, S., Samore, M.H.: Automatic de-identification of textual documents in the electronic health record: a review of recent research. BMC Med. Res. Methodol. 10, 70 (2010). https://doi.org/10.1186/1471-2288-10-70
Article Google Scholar
Presidio—Data Protection and Anonymization API. (2022). [Python]. Microsoft. https://github.com/microsoft/presidio. Original work published 2018
Young, E.M.: Educational privacy in the online classroom: FERPA, MOOCS, and the Big Data Conundrum. Harvard J. Law Technol. 28(2) (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

Georgia State University, Atlanta, Georgia
Langdon Holmes, Scott Crossley, Nick Hayes & Dylan Kuehl
University of Virginia, Charlottesville, VA, USA
Anne Trumbore
University Politehnica of Bucharest, Bucharest, Romania
Gabriel Gutu-Robu

Authors

Langdon Holmes
View author publications
You can also search for this author in PubMed Google Scholar
Scott Crossley
View author publications
You can also search for this author in PubMed Google Scholar
Nick Hayes
View author publications
You can also search for this author in PubMed Google Scholar
Dylan Kuehl
View author publications
You can also search for this author in PubMed Google Scholar
Anne Trumbore
View author publications
You can also search for this author in PubMed Google Scholar
Gabriel Gutu-Robu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Scott Crossley .

Editor information

Editors and Affiliations

Department of Computer Science, University Politehnica of Bucharest, Bucharest, Romania
Mihai Dascalu
Dipartimento di Scienze Sociali, University of Siena, Politiche e Cognitive, Siena, Italy
Patrizia Marti
Istituto Tecnologie Didattiche, Consiglio Nazionale delle Ricerche, Genova, Genova, Italy
Francesca Pozzi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Holmes, L., Crossley, S., Hayes, N., Kuehl, D., Trumbore, A., Gutu-Robu, G. (2023). De-Identification of Student Writing in Technologically Mediated Educational Settings. In: Dascalu, M., Marti, P., Pozzi, F. (eds) Polyphonic Construction of Smart Learning Ecosystems. SLERD 2022. Smart Innovation, Systems and Technologies, vol 908. Springer, Singapore. https://doi.org/10.1007/978-981-19-5240-1_12

Download citation

DOI: https://doi.org/10.1007/978-981-19-5240-1_12
Published: 28 September 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-5239-5
Online ISBN: 978-981-19-5240-1
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics