ABSTRACT
Biomedical text mining is becoming increasingly important. Recently, biomedical pre-trained language models such as BioBERT and SciBERT, which can capture biomedical knowledge from text, have achieved promising results in biomedical NLP tasks. However, most biomedical pre-trained language models rely on the traditional masked language model (MLM) pre-training strategy, which cannot fully capture the semantic relations of context. It is challenging to learn biomedical knowledge via language models in the Chinese biomedical fields due to the lack of training resources and the extreme complexity and diversity of Chinese medical terminologies. To this end, we propose MedBERT-adv, which utilizes a biomedical knowledge infusion method that can effectively complement BERT-like models. Instead of using time-consuming medical expert annotation and inaccurate automatic annotation, we use the article structure in Baidu Encyclopedia as a weakly supervised signal, utilizing each medical term and its category as labels to pre-train the model. We also leverage adversarial training strategies like FGM for fine-tuning downstream tasks to further improve the performance of MedBERT-adv. We experimented with MedBERT-adv on the Chinese biomedical dataset CBLUE using eight NLP tasks. Among all of them, our proposed model obtained an average 1.8% improvement in average score than four baseline models, demonstrating the effectiveness of MedBERT-adv on Chinese biomedical text mining.
- Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. Scibert: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676 (2019).Google Scholar
- Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Ziqing Yang, Shijin Wang, and Guoping Hu. 2019. Pre-training with whole word masking for chinese bert. arXiv preprint arXiv:1906.08101 (2019).Google Scholar
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).Google Scholar
- Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and Harnessing Adversarial Examples. arXiv:1412.6572 [stat.ML]Google Scholar
- T. Guan, H. Zan, X. Zhou, H. Xu, and K Zhang. 2020. CMeIE: Construction and Evaluation of Chinese Medical Information Extraction Dataset. Natural Language Processing and Chinese Computing, 9th CCF International Conference, NLPCC 2020, Zhengzhou, China, October 14–18, 2020, Proceedings, Part I.Google Scholar
- Yun He, Ziwei Zhu, Yin Zhang, Qin Chen, and James Caverlee. 2020. Infusing disease knowledge into BERT for health question answering, medical inference and disease name recognition. arXiv preprint arXiv:2010.03746 (2020).Google Scholar
- Kexin Huang, Jaan Altosaar, and Rajesh Ranganath. 2019. Clinicalbert: Modeling clinical notes and predicting hospital readmission. arXiv preprint arXiv:1904.05342 (2019).Google Scholar
- Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Tuo Zhao. 2020. SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.acl-main.197Google ScholarCross Ref
- Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2020. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 4 (2020), 1234–1240.Google ScholarCross Ref
- Yuxiao Liang and Pengtao Xie. 2020. Identifying radiological findings related to covid-19 from medical literature. arXiv preprint arXiv:2004.01862 (2020).Google Scholar
- Xiaodong Liu, Hao Cheng, Pengcheng He, Weizhu Chen, Yu Wang, Hoifung Poon, and Jianfeng Gao. 2020. Adversarial training for large neural language models. arXiv preprint arXiv:2004.08994 (2020).Google Scholar
- Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2019. Towards Deep Learning Models Resistant to Adversarial Attacks. arXiv:1706.06083 [stat.ML]Google Scholar
- Takeru Miyato, Andrew M. Dai, and Ian Goodfellow. 2021. Adversarial Training Methods for Semi-Supervised Text Classification. arXiv:1605.07725 [stat.ML]Google Scholar
- Yifan Peng, Shankai Yan, and Zhiyong Lu. 2019. Transfer learning in biomedical natural language processing: an evaluation of BERT and ELMo on ten benchmarking datasets. arXiv preprint arXiv:1906.05474 (2019).Google Scholar
- Ali Shafahi, Mahyar Najibi, Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S. Davis, Gavin Taylor, and Tom Goldstein. 2019. Adversarial Training for Free! arXiv:1904.12843 [cs.LG]Google Scholar
- H. Zan, W. Li, K. Zhang, Y. Ye, and Z. Sui. 2021. Building a Pediatric Medical Corpus: Word Segmentation and Named Entity Annotation. Chinese Lexical Semantics.Google Scholar
- Dinghuai Zhang, Tianyuan Zhang, Yiping Lu, Zhanxing Zhu, and Bin Dong. 2019. You Only Propagate Once: Accelerating Adversarial Training via Maximal Principle. arXiv:1905.00877 [stat.ML]Google Scholar
- Ningyu Zhang, Mosha Chen, Zhen Bi, Xiaozhuan Liang, Lei Li, Xin Shang, Kangping Yin, Chuanqi Tan, Jian Xu, Fei Huang, Luo Si, Yuan Ni, Guotong Xie, Zhifang Sui, Baobao Chang, Hui Zong, Zheng Yuan, Linfeng Li, Jun Yan, Hongying Zan, Kunli Zhang, Buzhou Tang, and Qingcai Chen. 2021. CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark. arXiv preprint arXiv:2106.08087 (2021).Google Scholar
- Ningyu Zhang, Qianghuai Jia, Kangping Yin, Liang Dong, Feng Gao, and Nengwei Hua. 2020. Conceptualized representation learning for chinese biomedical text mining. arXiv preprint arXiv:2008.10813 (2020).Google Scholar
- Chen Zhu, Yu Cheng, Zhe Gan, Siqi Sun, Tom Goldstein, and Jingjing Liu. 2020. FreeLB: Enhanced Adversarial Training for Natural Language Understanding. arXiv:1909.11764 [cs.CL]Google Scholar
- Hui Zong, Jinxuan Yang, Zeyu Zhang, Zuofeng Li, and Xiaoyan Zhang. 2021. Semantic categorization of Chinese eligibility criteria in clinical trials using machine learning methods. BMC Medical Informatics Decis. Mak. 21, 1 (2021), 128. https://doi.org/10.1186/s12911-021-01487-wGoogle ScholarCross Ref
- Infusing Biomedical Knowledge into BERT for Chinese Biomedical NLP Tasks with Adversarial Training
Recommendations
Biomedical Term Disambiguation: An Application to Gene-Protein Name Disambiguation
ITNG '06: Proceedings of the Third International Conference on Information Technology: New GenerationsThe huge volumes of biomedical texts available online drives the increasing need for automated techniques to analyze and extract knowledge from these repositories of information. Resolving the ambiguity in biological terms in these texts is an important ...
Recognizing biomedical named entities in Chinese research abstracts
Canadian AI'08: Proceedings of the Canadian Society for computational studies of intelligence, 21st conference on Advances in artificial intelligenceMost research on biomedical named entity recognition has focused on English texts, e.g., MEDLINE abstracts. However, recent years have also seen significant growth of biomedical publications in other languages. For example, the Chinese Biomedical ...
Disambiguating biomedical acronyms using EMIM
SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information RetrievalExpanding a query with acronyms or their corresponding 'long-forms' has not been shown to provide consistent improvements in the biomedical IR literature. The major open issue with expanding acronyms in a query is their inherent ambiguity, as an acronym ...
Comments