Skip to main content

De-Identification of Student Writing in Technologically Mediated Educational Settings

  • Conference paper
  • First Online:
Polyphonic Construction of Smart Learning Ecosystems (SLERD 2022)

Abstract

When conducting research with data from smart learning systems, there is a need to protect user identities because the release of personally identifiable information (PII) poses a significant risk to participants and creates a barrier to analyzing data and/or creating open datasets. Massive open online courses (MOOCs) are a good example of learning systems where PII concerns may hamper data analysis, the well-being of users, and system innovation. PII is particularly hard to locate and clean because of the variations in formatting, texts, and assignments found in unstructured data. In particular, identifying and removing students’ names has proven difficult. This study examines the potential to use large, pre-trained language models to de-identify MOOC data and compares performance on these language models to human annotations. On a validation set, a pre-trained language model fine-tuned using spaCy default hyperparameters achieved 97% recall of student names in the validation set, including partial matches, and 30% precision. On a larger, unseen test set (n = 3,077), the model achieved 93% recall and 24% precision. The majority of the false positives leading to lower recall in the test set were known names belonging to authors and/or lecturers. The results of the ensemble approach used here show considerable promise for a difficult de-identification task and indicate that automated de-identification is, likely, mature enough for use on some education datasets. Clearing PII from smart learning systems would ethically protect learners within the systems, allowing for the release of large datasets that could be analyzed for intelligent insights to forward innovation within smart learning systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Anjum, M. M., Mohammed, N., Jiang, X.: De-identification of unstructured clinical texts from sequence to sequence perspective. In: Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, pp. 2438–2440 (2021). https://doi.org/10.1145/3460120.3485354

  2. Bosch, N., Crues, R. W., Shaik, N.: “Hello, [REDACTED]”: Protecting student privacy in analyses of online discussion forums. In: Proceedings of the 13th International Conference on Educational Data Mining, Vol. 11 (2020)

    Google Scholar 

  3. Chen, B., Chang, Y.H., Ouyang, F., Zhou, W.: Fostering student engagement in online discussion through social learning analytics. Internet Higher Educat. 37, 21–30 (2018). https://doi.org/10.1016/j.iheduc.2017.12.002

    Article  Google Scholar 

  4. Crossley, S., Paquette, L., Dascalu, M., McNamara, D.S., Baker, R.S: Combining click-stream data with NLP tools to better understand MOOC completion. In: Proceedings of the Sixth International Conference on Learning Analytics & Knowledge, pp. 6–14. New York, NY, USA (2016)

    Google Scholar 

  5. Deming, D.J., Goldin, C., Katz, L.F., Yuchtman, N.: Can online learning bend the higher education cost curve? Am. Econom. Rev. 105(5), 496–501 (2015). https://doi.org/10.1257/aer.p20151024

    Article  Google Scholar 

  6. Ferrández, O., South, B.R., Shen, S., Friedlin, F.J., Samore, M.H., Meystre, S.M.: Evaluating current automatic de-identification methods with Veteran’s health administration clinical documents. BMC Med. Res. Methodol. 12(1), 109 (2012). https://doi.org/10.1186/1471-2288-12-109

    Article  Google Scholar 

  7. Gayed, J.M., Carlon, M.K.J., Oriola, A.M., Cross, J.S.: Exploring an ai-based writing assistant’s impact on English language learners. Comput. Educat. Artific. Intell. 3, 100055 (2022). https://doi.org/10.1016/j.caeai.2022.100055

    Article  Google Scholar 

  8. Honnibal, M., Montani, I., Van Landeghem, S., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python [Python]. Explosion AI (2020)

    Google Scholar 

  9. Jiang, R., Banchs, R.E., Li, H.: Evaluating and combining named entity recognition systems. In: Proceedings of the Sixth Named Entity Workshop, pp. 21–27

    Google Scholar 

  10. Kleinberg, B., Mozes, M., Arntz, A., Verschuere, B.: Using named entities for computer automated verbal deception detection. J. Forensic Sci. 63(3), 714–723 (2018). https://doi.org/10.1111/1556-4029.13645

    Article  Google Scholar 

  11. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. http://arxiv.org/abs/1907.11692

  12. Murugadoss, K., Rajasekharan, A., Malin, B., Agarwal, V., Bade, S., Anderson, J.R., Ross, J.L., Faubion, W.A., Halamka, J.D., Soundararajan, V., Ardhanari, S.: Building a best-in-class automated de-identification tool for electronic health records through ensemble learning. Patterns 2(6), 100255 (2021). https://doi.org/10.1016/j.patter.2021.100255

    Article  Google Scholar 

  13. Nakayama, H., Kubo, T., Kamura, J., Taniguchi, Y., Liang, X.: doccano: Text Annotation Tool for Human (2018). https://github.com/doccano/doccano

  14. Nanda, G., Douglas, K.A.: Machine learning based decision support system for categorizing MOOC discussion forum posts. In: Proceedings of the 12th International Conference on Educational Data Mining (EDM 2019), pp. 619–622 (2019)

    Google Scholar 

  15. Meystre, S.M., Friedlin, F.J., South, B.R., Shen, S., Samore, M.H.: Automatic de-identification of textual documents in the electronic health record: a review of recent research. BMC Med. Res. Methodol. 10, 70 (2010). https://doi.org/10.1186/1471-2288-10-70

    Article  Google Scholar 

  16. Presidio—Data Protection and Anonymization API. (2022). [Python]. Microsoft. https://github.com/microsoft/presidio. Original work published 2018

  17. Young, E.M.: Educational privacy in the online classroom: FERPA, MOOCS, and the Big Data Conundrum. Harvard J. Law Technol. 28(2) (2015)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Scott Crossley .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Holmes, L., Crossley, S., Hayes, N., Kuehl, D., Trumbore, A., Gutu-Robu, G. (2023). De-Identification of Student Writing in Technologically Mediated Educational Settings. In: Dascalu, M., Marti, P., Pozzi, F. (eds) Polyphonic Construction of Smart Learning Ecosystems. SLERD 2022. Smart Innovation, Systems and Technologies, vol 908. Springer, Singapore. https://doi.org/10.1007/978-981-19-5240-1_12

Download citation

Publish with us

Policies and ethics