ABSTRACT
The Cancer Registry of Norway (CRN) collects information on cancer patients by receiving cancer messages from different medical entities (e.g., medical labs, hospitals) in Norway. Such messages are validated by an automated cancer registry system: GURI. Its correct operation is crucial since it lays the foundation for cancer research and provides critical cancer-related statistics to its stakeholders. Constructing a cyber-cyber digital twin (CCDT) for GURI can facilitate various experiments and advanced analyses of the operational state of GURI without requiring intensive interactions with the real system. However, GURI constantly evolves due to novel medical diagnostics and treatment, technological advances, etc. Accordingly, CCDT should evolve as well to synchronize with GURI. A key challenge of achieving such synchronization is that evolving CCDT needs abundant data labelled by the new GURI. To tackle this challenge, we propose EvoCLINICAL, which considers the CCDT developed for the previous version of GURI as the pretrained model and fine-tunes it with the dataset labelled by querying a new GURI version. EvoCLINICAL employs a genetic algorithm to select an optimal subset of cancer messages from a candidate dataset and query GURI with it. We evaluate EvoCLINICAL on three evolution processes. The precision, recall, and F1 score are all greater than 91%, demonstrating the effectiveness of EvoCLINICAL. Furthermore, we replace the active learning part of EvoCLINICAL with random selection to study the contribution of transfer learning to the overall performance of EvoCLINICAL. Results show that employing active learning in EvoCLINICAL increases its performances consistently.
- John Ahlgren, Kinga Bojarczuk, Sophia Drossopoulou, Inna Dvortsova, Johann George, Natalija Gucevska, Mark Harman, Maria Lomeli, Simon M. M. Lucas, Erik Meijer, Steve Omohundro, Rubmary Rojas, Silvia Sapora, and Norm Zhou. 2021. Facebook’s Cyber–Cyber and Cyber–Physical Digital Twins. In Proceedings of the 25th International Conference on Evaluation and Assessment in Software Engineering (EASE ’21). Association for Computing Machinery, New York, NY, USA. 1–9. isbn:9781450390538 https://doi.org/10.1145/3463274.3463275 Google ScholarDigital Library
- Andrea Arcuri and Lionel Briand. 2011. A Practical Guide for Using Statistical Tests to Assess Randomized Algorithms in Software Engineering. In Proceedings of the 33rd International Conference on Software Engineering (ICSE ’11). Association for Computing Machinery, New York, NY, USA. 1–10. isbn:9781450304450 https://doi.org/10.1145/1985793.1985795 Google ScholarDigital Library
- Andrea Arcuri, Juan Pablo Galeotti, Bogdan Marculescu, and Man Zhang. 2021. Evomaster: A search-based system test generation tool. Google Scholar
- Josh Attenberg and Foster Provost. 2011. Inactive Learning? Difficulties Employing Active Learning in Practice. SIGKDD Explor. Newsl., 12, 2 (2011), mar, 36–41. issn:1931-0145 https://doi.org/10.1145/1964897.1964906 Google ScholarDigital Library
- Mohamed Bekkar, Hassiba Kheliouane Djemaa, and Taklit Akrouf Alitouche. 2013. Evaluation measures for models assessment over imbalanced data sets. J Inf Eng Appl, 3, 10 (2013). Google Scholar
- Iwo Biał ynicki-Birula and Jerzy Mycielski. 1975. Uncertainty relations for information entropy in wave mechanics. Communications in Mathematical Physics, 44 (1975), 129–132. https://doi.org/10.1007/BF01608825 Google ScholarCross Ref
- Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. arxiv:2005.14165. Google Scholar
- Deng Cai and Xiaofei He. 2011. Manifold adaptive experimental design for text categorization. IEEE Transactions on Knowledge and Data Engineering, 24, 4 (2011), 707–719. https://doi.org/10.1109/TKDE.2011.104 Google ScholarDigital Library
- Cristian Cardellino, Serena Villata, Laura Alonso Alemany, and Elena Cabrio. 2015. Information extraction with active learning: A case study in legal text. In Computational Linguistics and Intelligent Text Processing: 16th International Conference, CICLing 2015, Cairo, Egypt, April 14-20, 2015, Proceedings, Part II 16. 483–494. https://doi.org/10.1007/978-3-319-18117-2_36 Google ScholarCross Ref
- Daniel Cer, Yinfei Yang, Sheng yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2018. Universal Sentence Encoder. arxiv:1803.11175. Google Scholar
- Alexis Conneau, Holger Schwenk, Loïc Barrault, and Yann Lecun. 2017. Very Deep Convolutional Networks for Text Classification. arxiv:1606.01781. Google Scholar
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arxiv:1810.04805. Google Scholar
- Juan J Durillo and Antonio J Nebro. 2011. jMetal: A Java framework for multi-objective optimization. Advances in Engineering Software, 42, 10 (2011), 760–771. https://doi.org/10.1016/j.advengsoft.2011.05.014 Google ScholarDigital Library
- Matthias Eckhart and Andreas Ekelhart. 2018. Towards Security-Aware Virtual Environments for Digital Twins. In Proceedings of the 4th ACM Workshop on Cyber-Physical System Security (CPSS ’18). Association for Computing Machinery, New York, NY, USA. 61–72. isbn:9781450357555 https://doi.org/10.1145/3198458.3198464 Google ScholarDigital Library
- Matthias Eckhart and Andreas Ekelhart. 2019. Digital Twins for Cyber-Physical Systems Security: State of the Art and Outlook. In Security and Quality in Cyber-Physical Systems Engineering: With Forewords by Robert M. Lee and Tom Gilb, Stefan Biffl, Matthias Eckhart, Arndt Lüder, and Edgar Weippl (Eds.). Springer International Publishing, Cham. 383–412. isbn:978-3-030-25312-7 https://doi.org/10.1007/978-3-030-25312-7_14 Google ScholarCross Ref
- International Agency for Research on Cancer. 2020. All Cancers Fact Sheet. https://gco.iarc.fr/today/data/factsheets/cancers/39-All-cancers-fact-sheet.pdf Accessed: May 7th, 2023 Google Scholar
- B. Fuglede and F. Topsoe. 2004. Jensen-Shannon divergence and Hilbert space embedding. In International Symposium onInformation Theory, 2004. ISIT 2004. Proceedings.. 31–. https://doi.org/10.1109/ISIT.2004.1365067 Google ScholarCross Ref
- Mohamed Goudjil, Mouloud Koudil, Mouldi Bedda, and Noureddine Ghoggali. 2018. A novel active learning method using SVM for text classification. International Journal of Automation and Computing, 15 (2018), 290–298. https://doi.org/10.1007/s11633-015-0912-z Google ScholarDigital Library
- Minyoung Huh, Pulkit Agrawal, and Alexei A. Efros. 2016. What makes ImageNet good for transfer learning? arxiv:1608.08614. Google Scholar
- Erblin Isaku, Hassan Sartaj, Christoph Laaber, Tao Yue, Shaukat Ali, Thomas Schwitalla, and Jan F. Nygård. 2023. Cost Reduction on Testing Evolving Cancer Registry System. In 2023 IEEE International Conference on Software Maintenance and Evolution (ICSME). Google Scholar
- Ajay J. Joshi, Fatih Porikli, and Nikolaos Papanikolopoulos. 2009. Multi-class active learning for image classification. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. 2372–2379. https://doi.org/10.1109/CVPR.2009.5206627 Google ScholarCross Ref
- Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. Bag of Tricks for Efficient Text Classification. arxiv:1607.01759. Google Scholar
- Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Method for Stochastic Optimization. arxiv:1412.6980. Google Scholar
- Christoph Laaber, Tao Yue, Shaukat Ali, Thomas Schwitalla, and Jan F. Nygård. 2023. Automated Test Generation for Medical Rules Web Services: A Case Study at the Cancer Registry of Norway. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2023). https://doi.org/10.1145/3611643.3613882 Google ScholarDigital Library
- Christoph Laaber, Tao Yue, Shaukat Ali, Thomas Schwitalla, and Jan F. Nygård. 2023. Challenges of Testing an Evolving Cancer Registration Support System in Practice. In 2023 IEEE/ACM 45th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). 355–359. https://doi.org/10.1109/ICSE-Companion58688.2023.00102 Google ScholarDigital Library
- Kun-Lin Liu, Wu-Jun Li, and Minyi Guo. 2012. Emoticon Smoothed Language Models for Twitter Sentiment Analysis. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence (AAAI’12). AAAI Press, 1678–1684. https://doi.org/10.5555/2900929.2900966 Google ScholarDigital Library
- Wenhe Liu, Xiaojun Chang, Ling Chen, Dinh Phung, Xiaoqin Zhang, Yi Yang, and Alexander G. Hauptmann. 2020. Pair-Based Uncertainty and Diversity Promoting Early Active Learning for Person Re-Identification. ACM Trans. Intell. Syst. Technol., 11, 2 (2020), Article 21, jan, 15 pages. issn:2157-6904 https://doi.org/10.1145/3372121 Google ScholarDigital Library
- Wei Liu, Tongge Xu, Qinghua Xu, Jiayu Song, and Yueran Zu. 2019. An Encoding Strategy Based Word-Character LSTM for Chinese NER. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota. 2379–2389. https://doi.org/10.18653/v1/N19-1247 Google ScholarCross Ref
- Chengjie Lu, Huihui Zhang, Tao Yue, and Shaukat Ali. 2021. Search-based selection and prioritization of test scenarios for autonomous driving systems. In Search-Based Software Engineering: 13th International Symposium, SSBSE 2021, Bari, Italy, October 11–12, 2021, Proceedings 13. 41–55. https://doi.org/10.1007/978-3-030-88106-1_4 Google ScholarDigital Library
- Karan Malhotra, Shubham Bansal, and Sriram Ganapathy. 2019. Active Learning Methods for Low Resource End-to-End Speech Recognition. In Proc. Interspeech 2019. 2215–2219. https://doi.org/10.21437/Interspeech.2019-2316 Google ScholarCross Ref
- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. arxiv:1301.3781. Google Scholar
- Mahdi Namazifar, Alexandros Papangelis, Gokhan Tur, and Dilek Hakkani-Tür. 2020. Language Model is All You Need: Natural Language Understanding as Question Answering. arxiv:2011.03023. Google Scholar
- Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar. 1532–1543. https://doi.org/10.3115/v1/D14-1162 Google ScholarCross Ref
- Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. arxiv:1802.05365. Google Scholar
- Oscar Reyes and Sebastián Ventura. 2018. Evolutionary Strategy to Perform Batch-Mode Active Learning on Multi-Label Data. ACM Trans. Intell. Syst. Technol., 9, 4 (2018), Article 46, jan, 26 pages. issn:2157-6904 https://doi.org/10.1145/3161606 Google ScholarDigital Library
- Seonghan Ryu, Seokhwan Kim, Junhwi Choi, Hwanjo Yu, and Gary Geunbae Lee. 2017. Neural sentence embedding using only in-domain sentences for out-of-domain sentence detection in dialog systems. Pattern Recognition Letters, 88 (2017), mar, 26–32. https://doi.org/10.1016/j.patrec.2017.01.008 Google ScholarDigital Library
- Cedric Seger. 2018. An investigation of categorical variable encoding techniques in machine learning: binary versus one-hot and feature hashing. Google Scholar
- Okeke Stephen, Uchenna Joseph Maduh, Sanjar Ibrokhimov, Kueh Lee Hui, Ahmed Abdulhakim Al-Absi, and Mangal Sain. 2019. A Multiple-Loss Dual-Output Convolutional Neural Network for Fashion Class Classification. In 2019 21st International Conference on Advanced Communication Technology (ICACT). 408–412. https://doi.org/10.23919/ICACT.2019.8701958 Google ScholarCross Ref
- Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. 2013. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning, Sanjoy Dasgupta and David McAllester (Eds.) (Proceedings of Machine Learning Research, Vol. 28). PMLR, Atlanta, Georgia, USA. 1139–1147. https://proceedings.mlr.press/v28/sutskever13.html Google Scholar
- Raquel Sánchez-Cauce, Jorge Pérez-Martín, and Manuel Luque. 2021. Multi-input convolutional neural network for breast cancer detection using thermal images and clinical data. Computer Methods and Programs in Biomedicine, 204 (2021), 106045. issn:0169-2607 https://doi.org/10.1016/j.cmpb.2021.106045 Google ScholarDigital Library
- Annegreet van Opbroek, M. Arfan Ikram, Meike W. Vernooij, and Marleen de Bruijne. 2015. Transfer Learning Improves Supervised Image Segmentation Across Imaging Protocols. IEEE Transactions on Medical Imaging, 34, 5 (2015), 1018–1030. https://doi.org/10.1109/TMI.2014.2366792 Google ScholarCross Ref
- Qinghua Xu, Shaukat Ali, Tao Yue, and Maite Arratibel. 2022. Uncertainty-Aware Transfer Learning to Evolve Digital Twins for Industrial Elevators. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2022). Association for Computing Machinery, New York, NY, USA. 1257–1268. isbn:9781450394130 https://doi.org/10.1145/3540250.3558957 Google ScholarDigital Library
- Qinghua Xu, Shaukat Ali, Tao Yue, Nedim Zaimovic, and Singh Inderjeet. 2023. Uncertainty-Aware Transfer Learning to Evolve Digital Twins for Industrial Elevators. In Proceedings of the 31th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2023). Association for Computing Machinery, New York, NY, USA. 11 pages. isbn:979-8-4007-0327-0/23/12 https://doi.org/10.1145/3611643.3613879 Google ScholarDigital Library
- Jinsong Yu, Yue Song, Diyin Tang, and Jing Dai. 2021. A Digital Twin approach based on nonparametric Bayesian network for complex system health monitoring. Journal of Manufacturing Systems, 58 (2021), 293–304. issn:0278-6125 https://doi.org/10.1016/j.jmsy.2020.07.005 Digital Twin towards Smart Manufacturing and Industry 4.0 Google ScholarCross Ref
- Eckart Zitzler and Simon Künzli. 2004. Indicator-based selection in multiobjective search. In PPSN. 4, 832–842. https://doi.org/10.1007/978-3-540-30217-9_84 Google ScholarCross Ref
Index Terms
- EvoCLINICAL: Evolving Cyber-Cyber Digital Twin with Active Transfer Learning for Automated Cancer Registry System
Recommendations
Transfer active learning
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge managementActive learning traditionally assumes that labeled and unlabeled samples are subject to the same distributions and the goal of an active learner is to label the most informative unlabeled samples. In reality, situations may exist that we may not have ...
Transfer Learning Based Classification of Cervical Cancer Immunohistochemistry Images
ISICDM 2019: Proceedings of the Third International Symposium on Image Computing and Digital MedicineCervical cancer is the fourth leading cause of cancer-related deaths. It is very important to make the precise diagnosis for the early stage of cervical cancer. In recent years, transfer Learning makes a great breakthrough in the field of machine ...
Knowledge transfer for multi-labeler active learning
ECMLPKDD'13: Proceedings of the 2013th European Conference on Machine Learning and Knowledge Discovery in Databases - Volume Part IIn this paper, we address multi-labeler active learning, where data labels can be acquired from multiple labelers with various levels of expertise. Because obtaining labels for data instances can be very costly and time-consuming, it is highly desirable ...
Comments