A crowd-efficient learning approach for NER based on online encyclopedia

Li, Maolong; Li, Zhixu; Yang, Qiang; Chen, Zhigang; Zhao, Pengpeng; Zhao, Lei

doi:10.1007/s11280-019-00736-3

A crowd-efficient learning approach for NER based on online encyclopedia

Published: 02 December 2019

Volume 23, pages 453–470, (2020)
Cite this article

World Wide Web Aims and scope Submit manuscript

Maolong Li¹,
Zhixu Li ORCID: orcid.org/0000-0003-2355-288X¹,
Qiang Yang²,
Zhigang Chen^3,4,
Pengpeng Zhao¹ &
…
Lei Zhao¹

367 Accesses
2 Citations
Explore all metrics

Abstract

Named Entity Recognition (NER) is a core task of NLP. State-of-art supervised NER models rely heavily on a large amount of high-quality annotated data, which is quite expensive to obtain. Various existing ways have been proposed to reduce the heavy reliance on large training data, but only with limited effect. In this paper, we propose a crowd-efficient learning approach for supervised NER learning by making full use of the online encyclopedia pages. In our approach, we first define three criteria (representativeness, informativeness, diversity) to help select a much smaller set of samples for crowd labeling. We then propose a data augmentation method, which could generate a lot more training data with the help of the structured knowledge of online encyclopedia to greatly augment the training effect. After conducting model training on the augmented sample set, we re-select some new samples for crowd labeling for model refinement. We perform the training and selection procedure iteratively until the model could not be further improved or the performance of the model meets our requirement. Our empirical study conducted on several real data collections shows that our approach could reduce 50% manual annotations with almost the same NER performance as the fully trained model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Figure 3

An Unsupervised Learning Approach for NER Based on Online Encyclopedia

Harnessing Diversity in Crowds and Machines for Better NER Performance

Iterative Strategy for Named Entity Recognition with Imperfect Annotations

References

Bi, W, Wang, L, Kwok, JT, Tu, Z: Learning to predict from crowdsourced data. In: UAI, pp 82–91 (2014)
Collobert, R, Weston, J: A unified architecture for natural language processing: Deep neural networks with multitask learning. In: Proceedings of the 25th International Conference on Machine learning, pp 160–167. ACM (2008)
Collobert, R, Weston, J, Bottou, L, Karlen, M, Kavukcuoglu, K, Kuksa, P: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493–2537 (2011)
MATH Google Scholar
Devlin, J, Chang, M.-W., Lee, K, Toutanova, K: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 (2018)
Dredze, M, Talukdar, PP, Crammer, K: Sequence learning from data with multiple labels. In: Workshop Co-Chairs, p 39 (2009)
Dumitrache, A, Aroyo, L, Welty, C: Crowdsourcing ground truth for medical relation extraction. ACM Trans. Interact. Intell. Syst. (TiiS) 8(2), 12 (2018)
Google Scholar
Felt, P., Black, K., Ringger, E., Seppi, K., Haertel, R.: Early gains matter: A case for preferring generative over discriminative crowdsourcing models. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp 882–891 (2015)
Forney, DG: The viterbi algorithm. Proc. IEEE 61(3), 268–278 (1973)
Article MathSciNet Google Scholar
Grishman, R, Sundheim, B: Message understanding conference-6: A brief history. In: COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics, vol. 1 (1996)
Habibi, M, Weber, L, Neves, M, Wiegandt, DL, Leser, U: Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics 33(14), i37–i48 (2017)
Article Google Scholar
Huang, Z, Xu, W, Yu, K: Bidirectional lstm-crf models for sequence tagging. arXiv:1508.01991 (2015)
Huang, G, Liu, Z, Van Der Maaten, L, Weinberger, KQ: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4700–4708 (2017)
Lafferty, J, McCallum, A, Pereira, FCN: Conditional random fields: Probabilistic models for segmenting and labeling sequence data (2001)
Lample, G, Ballesteros, M, Subramanian, S, Kawakami, K, Dyer, C: Neural architectures for named entity recognition. arXiv:1603.01360 (2016)
LeCun, Y, Bengio, Y, et al.: Convolutional networks for images, speech, and time series. Handbook Brain Theory Neural Netw. 3361(10), 1995 (1995)
Google Scholar
Levow, G.-A.: The third international chinese language processing bakeoff: Word segmentation and named entity recognition. In: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, pp 108–117 (2006)
Li, Y, Bontcheva, K, Cunningham, H: Svm based learning system for information extraction. In: International Workshop on Deterministic and Statistical Methods in Machine Learning, pp 319–339. Springer (2004)
Li, S, Zhao, Z, Hu, R, Li, W, Liu, T, Du, X: Analogical reasoning on chinese morphological and semantic relations. arXiv:1805.06504(2018)
Mou, L, Meng, Z, Yan, R, Li, G, Xu, Y, Zhang, L, Jin, Z: How transferable are neural networks in nlp applications? arXiv:1603.06111 (2016)
Nguyen, AT, Wallace, BC, Li, JJ, Nenkova, A, Lease, M: Aggregating and predicting sequence labels from crowd annotations. In: Proceedings of the conference. Association for Computational Linguistics. Meeting, vol. 2017, p 299. NIH Public Access (2017)
Ni, J, Florian, R: Improving multilingual named entity recognition with wikipedia entity type mapping. arXiv:1707.02459 (2017)
Noraset, T, Bhagavatula, C, Downey, D: Websail wikifier at erd 2014. In: Proceedings of the First International Workshop on Entity Recognition & Disambiguation, pp 119–124. ACM (2014)
Nothman, J, Ringland, N, Radford, W, Murphy, T, Curran, JR: Learning multilingual named entity recognition from wikipedia. Artif. Intell. 194, 151–175 (2013)
Article MathSciNet Google Scholar
Peters, ME, Ammar, W, Bhagavatula, C, Power, R: Semi-supervised sequence tagging with bidirectional language models. arXiv:1705.00108 (2017)
Richman, AE, Schone, P: Mining wiki resources for multilingual named entity recognition. In: Proceedings of ACL-08: HLT, pp 1–9 (2008)
Rodrigues, F, Pereira, F, Ribeiro, B: Sequence labeling with multiple annotators. Mach. Learn. 95(2), 165–181 (2014)
Article MathSciNet Google Scholar
Shannon, CE: A mathematical theory of communication. Bell Syst. Techn. J. 27 (3), 379–423 (1948)
Article MathSciNet Google Scholar
Shen, Y, Yun, H, Lipton, ZC, Kronrod, Y, Anandkumar, A: Deep active learning for named entity recognition. arXiv:1707.05928 (2017)
Snow, R, O’Connor, B, Jurafsky, D, Ng, AY: Cheap and fast—but is it good?: evaluating non-expert annotations for natural language tasks. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp 254–263. Association for Computational Linguistics (2008)
Sun, J: ’jieba’chinese word segmentation tool (2012)
Tjong, EF, Sang, K, De Meulder, F: Introduction to the conll-2003 shared task: Language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, vol. 4, pp 142–147. Association for Computational Linguistics (2003)
Van Dyk, DA, Meng, X.-L.: The art of data augmentation. J. Comput. Graph. Stat. 10(1), 1–50 (2001)
Article MathSciNet Google Scholar
Wang, WY, Yang, D: That’s so annoying!!!: A lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using# petpeeve tweets. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp 2557–2563 (2015)
Weischedel, R, Pradhan, S, Ramshaw, L, Palmer, M, Xue, N, Marcus, M, Taylor, A, Greenberg, C, Hovy, E, Belvin, R, et al: Ontonotes release 4.0. LDC2011T03. Linguistic Data Consortium, Philadelphia (2011)
Google Scholar
Wong, SC, Gatt, A, Stamatescu, V, McDonnell, MD: Understanding data augmentation for classification: when to warp? arXiv:1609.08764(2016)
Xu, Y, Jia, R, Mou, L, Li, G, Chen, Y, Lu, Y, Jin, Z: Improved relation classification by deep recurrent neural networks with data augmentation. arXiv:1601.03651 (2016)
Yadav, V, Bethard, S: A survey on recent advances in named entity recognition from deep learning models. In: Proceedings of the 27th International Conference on Computational Linguistics, pp 2145–2158 (2018)
Yang, Y, Zhang, M, Chen, W, Zhang, W, Wang, H, Zhang, M: Adversarial learning for chinese ner from crowd annotations. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Zhou, GD, Su, J: Named entity recognition using an hmm-based chunk tagger. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp 473–480. Association for Computational Linguistics (2002)

Download references

Acknowledgments

This research is partially supported by Natural Science Foundation of Jiangsu Province (No. BK20191420), National Natural Science Foundation of China (Grant No. 61632016, 61572336, 61572335, 61772356), Natural Science Research Project of Jiangsu Higher Education Institution (No. 17KJA520003, 18KJA520010), and the Open Program of Neusoft Corporation (No. SKLSAOP1801).

Author information

Authors and Affiliations

Institute of Artificial Intelligence, School of Computer Science and Technology, Soochow University, Suzhou, China
Maolong Li, Zhixu Li, Pengpeng Zhao & Lei Zhao
King Abdullah University of Science and Technology, Jeddah, Saudi Arabia
Qiang Yang
IFLYTEK Research, Suzhou, China
Zhigang Chen
State Key Laboratory of Cognitive Intelligence, IFLYTEK, Hefei, China
Zhigang Chen

Authors

Maolong Li
View author publications
You can also search for this author in PubMed Google Scholar
Zhixu Li
View author publications
You can also search for this author in PubMed Google Scholar
Qiang Yang
View author publications
You can also search for this author in PubMed Google Scholar
Zhigang Chen
View author publications
You can also search for this author in PubMed Google Scholar
Pengpeng Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Lei Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhixu Li.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: Special Issue on Trust, Privacy, and Security in Crowdsourcing Computing

Guest Editors: An Liu, Guanfeng Liu, Mehmet A. Orgun, and Qing Li

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, M., Li, Z., Yang, Q. et al. A crowd-efficient learning approach for NER based on online encyclopedia. World Wide Web 23, 453–470 (2020). https://doi.org/10.1007/s11280-019-00736-3

Download citation

Received: 01 May 2019
Revised: 17 August 2019
Accepted: 22 September 2019
Published: 02 December 2019
Issue Date: January 2020
DOI: https://doi.org/10.1007/s11280-019-00736-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A crowd-efficient learning approach for NER based on online encyclopedia

Abstract

Access this article

Similar content being viewed by others

An Unsupervised Learning Approach for NER Based on Online Encyclopedia

Harnessing Diversity in Crowds and Machines for Better NER Performance

Iterative Strategy for Named Entity Recognition with Imperfect Annotations

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A crowd-efficient learning approach for NER based on online encyclopedia

Abstract

Access this article

Similar content being viewed by others

An Unsupervised Learning Approach for NER Based on Online Encyclopedia

Harnessing Diversity in Crowds and Machines for Better NER Performance

Iterative Strategy for Named Entity Recognition with Imperfect Annotations

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation