Abstract
Technological advancements in data science have offered us affordable storage and efficient algorithms to query a large volume of data. Our health records are a significant part of this data, which is pivotal for healthcare providers and can be utilized in our well-being. The clinical note in electronic health records is one such category that collects a patient’s complete medical information during different timesteps of patient care available in the form of free-texts. Thus, these unstructured textual notes contain events from a patient’s admission to discharge, which can prove to be significant for future medical decisions. However, since these texts also contain sensitive information about the patient and the attending medical professionals, such notes cannot be shared publicly. This privacy issue has thwarted timely discoveries on this plethora of untapped information. Therefore, in this work, we intend to generate synthetic medical texts from a private or sanitized (de-identified) clinical text corpus and analyze their utility rigorously in different metrics and levels. Experimental results promote the applicability of our generated data as it achieves more than \(80\%\) accuracy in different pragmatic classification problems and matches (or outperforms) the original text data.
- [1] . 2018. Machine learning, natural language programming, and electronic health records: The next step in the artificial intelligence journey?Journal of Allergy and Clinical Immunology 141, 6 (2018), 2019–2021.Google Scholar
- [2] . 2020. Deep learning in clinical natural language processing: A methodical review. Journal of the American Medical Informatics Association 27, 3 (2020), 457–470.Google ScholarCross Ref
- [3] . 2018. Journal of Biomedical Informatics 77 (2018), 34–49.Google ScholarCross Ref
- [4] . 2004. Physicians’ use of electronic medical records: Barriers and solutions. Health Affairs 23, 2 (2004), 116–126.Google ScholarCross Ref
- [5] . 2006. Pipeda: A constitutional analysis. Canadian Bar Review 85 (2006), 317.Google Scholar
- [6] . 2003. HIPAA regulations-a new era of medical-record privacy?New England Journal of Medicine 348, 15 (2003), 1486–1490.Google ScholarCross Ref
- [7] . 1996. Health Insurance Portability and Accountability Act of 1996. Public Law 104 (1996), 191.Google Scholar
- [8] . 2017. Privacy-preserving techniques of genomic data—A survey. Briefings in Bioinformatics 20, 3 (2017), 887–895.Google Scholar
- [9] . 2016. De-identification of patient notes with recurrent neural networks. Journal of the American Medical Informatics Association 24, 3 (
12 2016), 596–606.DOI: DOI: http://dx.doi.org/10.1093/jamia/ocw156Google Scholar - [10] . 2020. De-identification of electronic health record using neural network. Scientific Reports 10, 1 (2020), 1–11.Google Scholar
- [11] . 2004. Computer-assisted de-identification of free text in the MIMIC II database. In Proceedings of Computers in Cardiology
(CinC’04) .Google Scholar - [12] . 2005. De-identification algorithm for free-text nursing notes. In Proceedings of Computers in Cardiology
(CinC’05) .Google Scholar - [13] . 2008. Automated de-identification of free-text medical records. BMC Medical Informatics and Decision Making 8 (2008), Article 32.Google Scholar
- [14] . 2018. A deep learning architecture for de-identification of patient notes: Implementation and evaluation. arXiv:1810.01570.Google Scholar
- [15] . 2020. A review of automatic end-to-end de-identification: Is high accuracy the only metric?Applied Artificial Intelligence 34, 3 (2020), 251–269.Google Scholar
- [16] . 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019), 9.Google Scholar
- [17] . 2017. SeqGAN: Sequence generative adversarial nets with policy gradient. In Proceedings of the 31st AAAI Conference on Artificial Intelligence. Google ScholarDigital Library
- [18] . 2015. De-Identification of Personal Information. National Institute of Standards and Technology.Google Scholar
- [19] . 2005. Measuring diagnoses: ICD code accuracy. Health Services Research 40, 5p2 (2005), 1620–1639.Google ScholarCross Ref
- [20] . 2018. Long text generation via adversarial training with leaked information. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence. Google ScholarDigital Library
- [21] . 2016. Semi-supervised knowledge transfer for deep learning from private training data. arXiv:1610.05755.Google Scholar
- [22] . 2015. Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1. Journal of Biomedical Information 58 (2015), S11–S19. Google ScholarDigital Library
- [23] . 2016. MIMIC-III, a freely accessible critical care database. Nature Scientific Data 3 (2016), Article 160035.Google Scholar
- [24] . 1999. A general language model for information retrieval. In Proceedings of the 8th International Conference on Information and Knowledge Management. 316–321. Google ScholarDigital Library
- [25] . 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805.Google Scholar
- [26] . 2003. A neural probabilistic language model. Journal of Machine Learning Research 3 (2003), 1137–1155. Google ScholarDigital Library
- [27] . 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008. Google ScholarDigital Library
- [28] . 2006. Differential privacy. In Proceedings of the 33rd International Conference on Automata, Languages, and Programming—Volume Part II
(ICALP’06) . 1–12. Google ScholarDigital Library - [29] . 2014. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science 9, 3–4 (2014), 211–407. Google ScholarDigital Library
- [30] . 2020. Differentially private language models benefit from public pre-training. arXiv:2009.05886.Google Scholar
- [31] . 2019. Towards automatic generation of shareable synthetic clinical notes using neural language models. In Proceedings of the 2nd Clinical Natural Language Processing Workshop. 35–45.Google Scholar
- [32] . 2013. Efficient estimation of word representations in vector space. arXiv:1301.3781.Google Scholar
- [33] . 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing
(EMNLP’14) . 1532–1543.Google ScholarCross Ref - [34] . 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311–318. Google ScholarDigital Library
- [35] . 2014. Distributed and Incremental Clustering Using Shared Nearest Neighbours. Ph.D. Dissertation. Utrecht University.Google Scholar
- [36] . 1993. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19, 1 (1993), 61–74. https://www.aclweb.org/anthology/J93-1003. Google ScholarDigital Library
- [37] . 2004. Extending the Cochran rule for the comparison of word frequencies between corpora. In Proceedings of the 7th International Conference on Statistical Analysis of Textual Data
(JADT’04) . 926–936.Google Scholar - [38] . 2000. Comparing corpora using frequency profiling. In Proceedings of the Workshop on Comparing Corpora—Volume 9. 1–6. Google ScholarDigital Library
- [39] . 2016. Significance testing of word frequencies in corpora. Literary and Linguistic Computing 31, 2 (2016), 374–397.Google Scholar
- [40] . 2006. Measures of effect size for chi-squared and likelihood-ratio goodness-of-fit tests. Perceptual and Motor Skills 103, 2 (2006), 412–414.Google Scholar
- [41] . 2017. An NLP-based cognitive system for disease status identification in electronic health records. In Proceedings of the 2017 IEEE EMBS International Conference on Biomedical and Health Informatics
(BHI’17) . IEEE, Los Alamitos, CA, 89–92.Google Scholar - [42] . 2004. Adversarial classification. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 99–108. Google ScholarDigital Library
- [43] . 2017. Adversarial learning for neural dialogue generation. arXiv:1701.06547.Google Scholar
- [44] . 2008. Automated de-identification of free-text medical records. BMC Medical Informatics and Decision Making 8, 1 (2008), 32.Google ScholarCross Ref
- [45] . 2019. PATE-GAN: Generating synthetic data with differential privacy guarantees. In Proceedings of the 2019 International Conference on Learning Representations
(ICLR’19) .Google Scholar - [46] . 2020. Language models are few-shot learners. arXiv:2005.14165.Google Scholar
- [47] . 2018. Natural language generation for electronic health records. NPJ Digital Medicine 1, 1 (2018), 1–7.Google Scholar
- [48] . 2018. The secret sharer: Evaluating and testing unintended memorization in neural networks. arXiv:1802.08232.Google Scholar
- [49] . 2019. A systematic review of natural language processing and text mining of symptoms from electronic patient-authored text data. International Journal of Medical Informatics 125 (2019), 37–46.Google Scholar
- [50] . 2018. Opportunities and challenges in developing deep learning models using electronic health records data: A systematic review. Journal of the American Medical Informatics Association 25, 10 (2018), 1419–1428.Google ScholarCross Ref
- [51] . 2017. Deep EHR: A survey of recent advances in deep learning techniques for electronic health record (EHR) analysis. IEEE Journal of Biomedical and Health Informatics 22, 5 (2017), 1589–1604.Google ScholarCross Ref
- [52] . 2013. Auto-encoding variational Bayes. arXiv:1312.6114.Google Scholar
- [53] . 2015. Variational inference with normalizing flows. arXiv:1505.05770.Google Scholar
- [54] . 2016. Tutorial on variational autoencoders. arXiv:1606.05908.Google Scholar
- [55] . 2015. Generating sentences from a continuous space. arXiv:1511.06349.Google Scholar
- [56] . 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems. 2672–2680. Google ScholarDigital Library
- [57] . 2017. Generative adversarial networks for electronic health records: a framework for exploring and evaluating methods for predicting drug-induced laboratory test trajectories. arXiv:1712.00164.Google Scholar
- [58] . 2017. Real-valued (medical) time series generation with recurrent conditional GANs. arXiv:1706.02633.Google Scholar
- [59] . 2018. Generation of synthetic electronic medical record text. In Proceedings of the 2018 IEEE International Conference on Bioinformatics and Biomedicine
(BIBM’18) . IEEE, Los Alamitos, CA, 374–380.Google Scholar - [60] . 2018. MaskGAN: Better text generation via filling in the. arXiv:1801.07736.Google Scholar
- [61] . 2018. Scalable private learning with PATE. arXiv:1802.08908.Google Scholar
- [62] . 2020. Generation and evaluation of artificial mental health records for natural language processing. NPJ Digital Medicine 3, 1 (2020), 1–9.Google ScholarCross Ref
- [63] . 2005. De-identification algorithm for free-text nursing notes. In Proceedings of Computers in Cardiology. IEEE, Los Alamitos, CA, 331–334.Google Scholar
- [64] . 2015. Practical applications for natural language processing in clinical research: The 2014 i2b2/UTHealth shared tasks. Journal of Biomedical Informatics 58, Suppl. (2015), S1. Google ScholarDigital Library
- [65] . 2015. Automatic detection of protected health information from clinic narratives. Journal of Biomedical Informatics 58 (2015), S30–S38. Google ScholarDigital Library
- [66] . 2010. The MITRE Identification Scrubber Toolkit: Design, training, and assessment. International Journal of Medical Informatics 79, 12 (2010), 849–859.Google Scholar
- [67] . 2015. Hidden Markov model using Dirichlet process for de-identification. Journal of Biomedical Informatics 58 (2015), S60–S66. Google ScholarDigital Library
- [68] . 2020. Toward robustness and privacy in federated learning: Experimenting with local and central differential privacy. arXiv:2009.03561.Google Scholar
- [69] . 2018. Privacy and machine learning: Two unexpected allies?Cleverhans Blog. Retrieved August 9, 2021 from http://www.cleverhans.io/privacy/2018/04/29/privacy-and-machine-learning.html.Google Scholar
- [70] . 2013. Stochastic gradient descent with differentially private updates. In Proceedings of the 2013 IEEE Global Conference on Signal and Information Processing. IEEE, Los Alamitos, CA, 245–248.Google ScholarCross Ref
- [71] . 2015. Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. 1310–1321. Google ScholarDigital Library
- [72] . 2016. An overview of gradient descent optimization algorithms. arXiv:1609.04747.Google Scholar
- [73] . 2016. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. 308–318. Google ScholarDigital Library
- [74] . 2017. Learning differentially private recurrent language models. arXiv:1710.06963.Google Scholar
- [75] . 2018. SoK: Security and privacy in machine learning. In Proceedings of the 2018 IEEE European Symposium on Security and Privacy
(EuroS&P’18) . 399–414.DOI: DOI: http://dx.doi.org/10.1109/EuroSP.2018.00035Google ScholarCross Ref - [76] . 2017. Membership inference attacks against machine learning models. In Proceedings of the 2017 IEEE Symposium on Security and Privacy
(SP’17) . IEEE, Los Alamitos, CA, 3–18.Google ScholarCross Ref - [77] . 2020. BERT-attack: Adversarial attack against BERT using BERT. arXiv:2004.09984.Google Scholar
- [78] . 1996. Learning task-dependent distributed representations by backpropagation through structure. In Proceedings of the International Conference on Neural Networks
(ICNN’96) .Google ScholarCross Ref
Index Terms
- Differentially Private Medical Texts Generation Using Generative Neural Networks
Recommendations
Using Latent Class Analysis to Identify Sophistication Categories of Electronic Medical Record Systems in U.S. Acute Care Hospitals
Many believe that electronic medical record (EMR) systems hold promise for improving the quality of health care services. The body of research on this topic is still in the early stages, however, in part because of the challenge of measuring the ...
Development and validation of a continuous measure of patient condition using the Electronic Medical Record
Graphical abstractDisplay Omitted New method to estimate patient condition during a hospital visit.Patient condition is computed by summing risks measured in each of 26 variables.Leverages data already in the EMR: vital signs, lab results, nursing ...
Using electronic health record systems in diabetes care: emerging practices
IHI '10: Proceedings of the 1st ACM International Health Informatics SymposiumWhile there has been considerable attention devoted to the deployment of electronic health record (EHR) systems, there has been far less attention given to their appropriation for use in clinical encounters --- particularly in the context of complex, ...
Comments