Abstract
Institutions collect massive learning traces but they may not disclose it for privacy issues. Synthetic data generation opens new opportunities for research in education. In this paper we present a generative model for educational data that can preserve the privacy of participants, and an evaluation framework for comparing synthetic data generators. We show how naive pseudonymization can lead to re-identification threats and suggest techniques to guarantee privacy. We evaluate our method on existing massive educational open datasets.
J.-J. Vie and T. Rigaux—Equal contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Acs, G., Castelluccia, C., Chen, R.: Differentially private histogram publishing through lossy compression. In: 2012 IEEE 12th International Conference on Data Mining, pp. 1–10. IEEE (2012)
Berendt, B., Littlejohn, A., Blakemore, M.: AI in education: learner choice and fundamental rights. Learn. Media Technol. 45(3), 312–324 (2020)
Cablé, B., Guin, N., Lefevre, M.: An authoring tool for semi-automatic generation of self-assessment exercises. In: Lane, H.C., Yacef, K., Mostow, J., Pavlik, P. (eds.) AIED 2013. LNCS (LNAI), vol. 7926, pp. 679–682. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-39112-5_87
Chen, R., Acs, G., Castelluccia, C.: Differentially private sequential data publication via variable-length N-grams. In: Proceedings of the 2012 ACM Conference on Computer and Communications Security, pp. 638–649 (2012)
Choffin, B., Popineau, F., Bourda, Y., Vie, J.J.: DAS3H: modeling student learning and forgetting for optimally scheduling distributed practice of skills. arXiv preprint arXiv:1905.06873 (2019)
De Montjoye, Y.A., Hidalgo, C.A., Verleysen, M., Blondel, V.D.: Unique in the crowd: the privacy bounds of human mobility. Sci. Rep. 3(1), 1–5 (2013)
Denis, P.: Probabilistic inference using generators: the statues algorithm. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) SAI 2020. AISC, vol. 1229, pp. 133–154. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-52246-9_10
Dorodchi, M., Al-Hossami, E., Benedict, A., Demeter, E.: Using synthetic data generators to promote open science in higher education learning analytics. In: 2019 IEEE International Conference on Big Data (Big Data), pp. 4672–4675. IEEE (2019)
Dwork, C.: Differential privacy: a survey of results. In: Agrawal, M., Du, D., Duan, Z., Li, A. (eds.) TAMC 2008. LNCS, vol. 4978, pp. 1–19. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-79228-4_1
Gervet, T., Koedinger, K., Schneider, J., Mitchell, T., et al.: When is deep learning the best approach to knowledge tracing? J. Educ. Data Min. 12(3), 31–54 (2020)
Heffernan, N.T., Heffernan, C.L.: The ASSISTments ecosystem: building a platform that brings scientists and teachers together for minimally invasive research on human learning and teaching. Int. J. Artif. Intell. Educ. 24(4), 470–497 (2014)
Holmes, W., Iniesto, F., Sharples, M., Scanlon, E.: ETHICS in AIED: who cares? An EC-TEL workshop. In: EC-TEL 2019 Fourteenth European Conference on Technology Enhanced Learning (2019). https://oro.open.ac.uk/67263/
Jordon, J., et al.: Hide-and-seek privacy challenge. arXiv preprint arXiv:2007.12087 (2020)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Lee, J., Clifton, C.: How much is enough? Choosing \(\varepsilon \) for differential privacy. In: Lai, X., Zhou, J., Li, H. (eds.) ISC 2011. LNCS, vol. 7001, pp. 325–340. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-24861-0_22
Leinonen, J., Ihantola, P., Hellas, A.: Preventing keystroke based identification in open data sets. In: Proceedings of the Fourth ACM Conference on Learning@Scale, pp. 101–109 (2017)
Machanavajjhala, A., Kifer, D., Gehrke, J., Venkitasubramaniam, M.: L-diversity: privacy beyond k-anonymity. ACM Trans. Knowl. Discov. from Data (TKDD) 1(1), 3-es (2007)
Narayanan, A., Shmatikov, V.: Robust de-anonymization of large sparse datasets. In: 2008 IEEE Symposium on Security and Privacy (SP 2008), pp. 111–125. IEEE (2008)
Patki, N., Wedge, R., Veeramachaneni, K.: The synthetic data vault. In: 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp. 399–410 (2016). https://doi.org/10.1109/DSAA.2016.49
Pavlik, P.I., Jr., Cen, H., Koedinger, K.R.: Performance factors analysis-a new alternative to knowledge tracing (2009, online submission)
Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Piech, C., et al.: Deep knowledge tracing. In: Advances in Neural Information Processing Systems, vol. 28 (2015)
Ping, H., Stoyanovich, J., Howe, B.: DataSynthesizer: privacy-preserving synthetic datasets. In: Proceedings of the 29th International Conference on Scientific and Statistical Database Management, pp. 1–5 (2017)
Rasch, G.: On general laws and the meaning of measurement in psychology. In: Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 4: Contributions to Biology and Problems of Medicine, pp. 321–333. University of California Press, Berkeley (1961). https://projecteuclid.org/euclid.bsmsp/1200512895
Rocher, L., Hendrickx, J.M., De Montjoye, Y.A.: Estimating the success of re-identifications in incomplete datasets using generative models. Nat. Commun. 10(1), 1–9 (2019)
Settles, B., Brust, C., Gustafson, E., Hagiwara, M., Madnani, N.: Second language acquisition modeling. In: Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 56–65 (2018)
Shokri, R., Stronati, M., Song, C., Shmatikov, V.: Membership inference attacks against machine learning models. In: 2017 IEEE Symposium on Security and Privacy (SP), pp. 3–18. IEEE (2017)
Van Lehn, K.: Two pseudo-students: applications of machine learning to formative evaluation. Technical report, Carnegie-Mellon University, Pittsburgh, PA, Department of Psychology (1990)
VanLehn, K., Ohlsson, S., Nason, R.: Applications of simulated students: an exploration. J. Artif. Intell. Educ. 5, 135 (1994)
Wilson, K.H., Karklin, Y., Han, B., Ekanadham, C.: Back to the basics: Bayesian extensions of IRT outperform neural networks for proficiency estimation. In: International Educational Data Mining Society. ERIC (2016)
Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: PrivBayes: private data release via Bayesian networks. ACM Trans. Database Syst. (TODS) 42(4), 1–41 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Vie, JJ., Rigaux, T., Minn, S. (2022). Privacy-Preserving Synthetic Educational Data Generation. In: Hilliger, I., Muñoz-Merino, P.J., De Laet, T., Ortega-Arranz, A., Farrell, T. (eds) Educating for a New Future: Making Sense of Technology-Enhanced Learning Adoption. EC-TEL 2022. Lecture Notes in Computer Science, vol 13450. Springer, Cham. https://doi.org/10.1007/978-3-031-16290-9_29
Download citation
DOI: https://doi.org/10.1007/978-3-031-16290-9_29
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16289-3
Online ISBN: 978-3-031-16290-9
eBook Packages: Computer ScienceComputer Science (R0)