skip to main content
research-article

Fintech Key-Phrase: A New Chinese Financial High-Tech Dataset Accelerating Expression-Level Information Retrieval

Authors Info & Claims
Published:22 November 2023Publication History
Skip Abstract Section

Abstract

Expression-level information extraction is a challenging task in natural language processing (NLP), which aims to retrieve crucial semantic information from linguistic documents. However, there is a lack of up-to-date data resources for accelerating expression-level information extraction, particularly in the Chinese financial high technology field. To address this gap, we introduce Fintech Key-Phrase: a human-annotated key-phrase dataset for the Chinese financial high technology domain. This dataset comprises over 12K paragraphs along with annotated domain-specific key-phrases. We extract the publicly released reports, Chinese management’s discussion and analysis (CMD&A), from the renowned Chinese research data services platform (CNRDS) and then filter the reports related to high technology. The high technology key-phrases are annotated following pre-defined philosophy guidelines to ensure annotation quality. In order to better understand the limitations and challenges in the purposed dataset, we conducted comprehensive noise evaluation experiments for the Fintech Key-Phrase, including annotation consistency assessment and absolute annotation quality evaluation. To demonstrate the usefulness of our released Fintech Key-Phrase in retrieving valuable information in the Chinese financial high technology field, we evaluate its significance using several superior information retrieval systems as representative baselines and report corresponding performance statistics. Additionally, we further applied ChatGPT to the text augmentation approach of the Fintech Key-Phrase dataset. Extensive comparative experiments demonstrate that the augmented Fintech Key-Phrase dataset significantly improved the coverage and accuracy of extracting key phrases in the finance and high-tech domains. We believe that this dataset can facilitate scientific research and exploration in the Chinese financial high technology field. We have made the Fintech Key-Phrase dataset and the experimental code of the adopted baselines accessible on Github: https://github.com/albert-jin/Fintech-Key-Phrase. To encourage newcomers to participate in the financial high-tech domain information retrieval research, we have developed a series of tools, including an open website1 and corresponding real-time information retrieval APIs.2

REFERENCES

  1. [1] Vaswani Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (Long Beach, California, USA) (NIPS’17). Curran Associates Inc., Red Hook, NY, 6000–6010.Google ScholarGoogle Scholar
  2. [2] Benikova Darina, Biemann Chris, and Reznicek Marc. 2014. NoSta-D named entity annotation for german: Guidelines and dataset. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14), Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis (Eds.). European Language Resources Association (ELRA), Reykjavik, Iceland, 2524–2531. http://www.lrec-conf.org/proceedings/lrec2014/pdf/276_Paper.pdfGoogle ScholarGoogle Scholar
  3. [3] Cui Yiming, Che Wanxiang, Liu Ting, Qin Bing, and Yang Ziqing. 2021. Pre-training with whole word masking for chinese BERT. IEEE Transactions on Audio, Speech and Language Processing 29 (2021), 3504–3514. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Dai Haixing, Liu Zhengliang, Liao Wenxiong, Huang Xiaoke, Cao Yihan, Wu Zihao, Zhao Lin, Xu Shaochen, Liu Wei, Liu Ninghao, Li Sheng, Zhu Dajiang, Cai Hongmin, Sun Lichao, Li Quanzheng, Shen Dinggang, Liu Tianming, and Li Xiang. 2023. AugGPT: Leveraging ChatGPT for Text Data Augmentation. arXiv:2302.13007. Retrieved from https://arxiv.org/abs/2302.13007Google ScholarGoogle Scholar
  5. [5] Deng Jiankang, Guo Jia, Xue Niannan, and Zafeiriou Stefanos. 2019. ArcFace: Additive angular margin loss for deep face recognition. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 46854694. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Devlin Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Eddy Sean R.. 1996. Hidden Markov models. Current Opinion in Structural Biology 6, 3 (1996), 361365. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Fei Hao, Ren Yafeng, and Ji Donghong. 2020. Dispatched attention with multi-task learning for nested mention recognition. Information Sciences 513 (2020), 241251. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Forney Jr. G. D.. 1973. The viterbi algorithm. Proc. IEEE 61, 3 (1973), 268–278. Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Gao Jianqi, Yu Hang, and Zhang Shuang. 2022. Joint event causality extraction using dual-channel enhanced neural network. Knowledge-Based Systems 258 (2022), 109935. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Gao Jun, Zhao Huan, Yu Changlong, and Xu Ruifeng. 2023. Exploring the Feasibility of ChatGPT for Event Extraction. arXiv:2303.03836. Retrieved from https://arxiv.org/abs/2303.03836Google ScholarGoogle Scholar
  12. [12] Ghosh Soumitra, Ekbal Asif, and Bhattacharyya Pushpak. 2021. A multitask framework to detect depression, sentiment and multi-label emotion from suicide notes. Cognitive Computation 14, 1 (2021), 110129. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Ghosh Soumitra, Roy Swarup, Ekbal Asif, and Bhattacharyya Pushpak. 2022. CARES: CAuse recognition for emotion in suicide notes. In Proceedings of the Advances in Information Retrieval. Hagen Matthias, Verberne Suzan, Macdonald Craig, Seifert Christin, Balog Krisztian, Nørvåg Kjetil, and Setty Vinay (Eds.), Springer International Publishing, Cham, 128136. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Glorot and Bengio Yoshua. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics. Bengio Yoshua and LeCun Yann (Eds.), Vol. 9, PMLR, 249256.Google ScholarGoogle Scholar
  15. [15] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2015. Delving Deep into Rectifiers: Surpassing Human-level Performance on ImageNet Classification. In 2015 IEEE International Conference on Computer Vision (ICCV). 1026–1034. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Hochreiter Sepp and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9 (1997), 1735–80. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Huang Zhiheng, Xu Wei, and Yu Kai. 2015. Bidirectional LSTM-CRF Models for Sequence Tagging. arXiv:1508.01991. Retrieved from https://arxiv.org/abs/1508.01991Google ScholarGoogle Scholar
  18. [18] Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Global vectors for word representation. In Proceedings of the EMNLP. 15321543.Google ScholarGoogle Scholar
  19. [19] Jiao Wenxiang, Wang Wenxuan, Huang Jen tse, Wang Xing, and Tu Zhaopeng. 2023. Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine. arXiv:2301.08745. Retrieved from https://arxiv.org/abs/2301.08745Google ScholarGoogle Scholar
  20. [20] Jin Weiqiang, Zhao Biao, Yu Hang, Tao Xi, Yin Ruiping, and Liu Guizhong. 2022. Improving embedded knowledge graph multi-hop question answering by introducing relational chain reasoning. Data Mining and Knowledge Discovery 37, 1 (11 Nov 2022), 255–288. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Jin Weiqiang, Zhao Biao, Zhang Liwen, Liu Chenxing, and Yu Hang. 2023. Back to common sense: Oxford dictionary descriptive knowledge augmentation for aspect-based sentiment analysis. Information Processing and Management 60, 3 (2023), 103260. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Lafferty John D., McCallum Andrew, and Pereira Fernando C. N.. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning (ICML’01). Morgan Kaufmann Publishers Inc., San Francisco, CA, 282–289.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Lee Chih-Hen, Chiang Yi-Shyuan, and Wang Chuan-Ju. 2022. INForex: Interactive news digest for forex investors. In Proceedings of the Advances in Information Retrieval. Hagen Matthias, Verberne Suzan, Macdonald Craig, Seifert Christin, Balog Krisztian, Nørvåg Kjetil, and Setty Vinay (Eds.), Springer International Publishing, Cham, 300304. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Li Yuan, Fang Biaoyan, He Jiayuan, Yoshikawa Hiyori, Akhondi Saber A., Druckenbrodt Christian, Thorne Camilo, Zhai Zenan, Afzal Zubair, Cohn Trevor, Baldwin Timothy, and Verspoor Karin. 2022. The ChEMU 2022 evaluation campaign: Information extraction in chemical patents. In Proceedings of the Advances in Information Retrieval. Hagen Matthias, Verberne Suzan, Macdonald Craig, Seifert Christin, Balog Krisztian, Nørvåg Kjetil, and Setty Vinay (Eds.), Springer International Publishing, Cham, 400407. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Liu Hongzhe, Wang Ningwei, Li Xuewei, Xu Cheng, and Li Yaze. 2022. BFF R-CNN: Balanced feature fusion for object detection. IEICE TRANSACTIONS on Information and Systems 105, 8 (2022), 14721480.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Liu Lemao, Zhang Haisong, Jiang Haiyun, Li Yangming, Zhao Enbo, Xu Kun, Song Linfeng, Zheng Suncong, Zhou Botong, Zhu Jianchen, Feng Xiao, Chen Tao, Yang Tao, Yu Dong, Zhang Feng, Kang Zhanhui, and Shi Shuming. 2021. TexSmart: A system for enhanced natural language understanding. In Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP): System Demonstrations.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Liu Yinhan, Ott Myle, Goyal Naman, Du Jingfei, Joshi Mandar, Chen Danqi, Levy Omer, Lewis Mike, Zettlemoyer Luke, and Stoyanov Veselin. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692. Retrieved from https://arxiv.org/abs/1907.11692Google ScholarGoogle Scholar
  28. [28] Loshchilov Ilya and Hutter Frank. 2017. Decoupled weight decay regularization. In International Conference on Learning Representations. https://openreview.net/forum?id=Bkg6RiCqY7Google ScholarGoogle Scholar
  29. [29] Mikolov Tomás, Chen Kai, Corrado Greg, and Dean Jeffrey. 2013. Efficient estimation of word representations in vector space. In 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1301.3781Google ScholarGoogle Scholar
  30. [30] Nguyen Tuan-Anh Dang and Thanh Dat Nguyen. 2019. End-to-end information extraction by character-level embedding and multi-stage attentional U-Net. In Proceedings of the BMVC.Google ScholarGoogle Scholar
  31. [31] Palm Rasmus Berg, Hovy Dirk, Laws Florian, and Winther Ole. 2017. End-to-end information extraction without token-level supervision. In Proceedings of the Workshop on Speech-Centric Natural Language Processing. Association for Computational Linguistics, Copenhagen, Denmark, 4852. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Qiang Jipeng, Chen Ping, Ding Wei, Wang Tong, Xie Fei, and Wu Xindong. 2019. Heterogeneous-length text topic modeling for reader-aware multi-document summarization. ACM Transactions on Knowledge Discovery from Data 13, 4 (2019), 21 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Ringland Nicola. 2015-09-30. Structured Named Entities. Ph. D. Dissertation. The University of Sydney. Retrieved from http://hdl.handle.net/2123/14558Google ScholarGoogle Scholar
  34. [34] Shen Yongliang, Ma Xinyin, Tan Zeqi, Zhang Shuai, Wang Wen, and Lu Weiming. 2021. Locate and label: A two-stage identifier for nested named entity recognition. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 27822794.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Shen Yongliang, Wang Xiaobin, Tan Zeqi, Xu Guangwei, Xie Pengjun, Huang Fei, Lu Weiming, and Zhuang Yueting. 2022. Parallel instance query network for named entity recognition. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 947961.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Sun Shuyan. 2011. Meta-analysis of Cohen’s kappa. Health Services and Outcomes Research Methodology 11, 3 (2011), 145163. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Wang Ningwei, Li Yaze, and Liu Hongzhe. 2021. Reinforced neighbour feature fusion object detection with deep learning. Symmetry 13, 9 (2021), 1623.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Wang Yu, Tong Hanghang, Zhu Ziye, and Li Yun. 2022. Nested named entity recognition: A survey. ACM Transactions on Knowledge Discovery from Data 16, 6, Article 108 (2022), 29 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Wei Xiang, Cui Xingyu, Cheng Ning, Wang Xiaobin, Zhang Xin, Huang Shen, Xie Pengjun, Xu Jinan, Chen Yufeng, Zhang Meishan, Jiang Yong, and Han Wenjuan. 2023. Zero-Shot Information Extraction via Chatting with ChatGPT. arXiv:2302.10205. Retrieved from https://arxiv.org/abs/2302.10205Google ScholarGoogle Scholar
  40. [40] Xia Nan, Yu Hang, Wang Yin, Xuan Junyu, and Luo Xiangfeng. 2023. DAFS: A domain aware few shot generative model for event detection. Machine Learning 112, 3 (2023), 10111031. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Xiao Yuquan and Du Qinghe. 2023. Statistical Age-of-Information Optimization for Status Update over Multi-State Fading Channels. arXiv:2303.11153. Retrieved from https://arxiv.org/abs/2303.11153Google ScholarGoogle Scholar
  42. [42] Yan Hang, Deng Bocao, Li Xiaonan, and Qiu Xipeng. 2019. TENER: Adapting Transformer Encoder for Named Entity Recognition. arXiv:1911.04474. Retrieved from https://arxiv.org/abs/1911.04474Google ScholarGoogle Scholar
  43. [43] Yuquan Xiao, Qinghe Du, Wenchi Cheng, and Wei Zhang. 2023. Adaptive sampling and transmission for minimizing age of information in metaverse. IEEE Journal on Selected Areas in Communications, Early Access (2023).Google ScholarGoogle Scholar
  44. [44] Zhao Biao, Jin Weiqiang, Ser Javier Del, and Yang Guang. 2023. ChatAgri: Exploring potentials of ChatGPT on cross-linguistic agricultural text classification. Neurocomputing 557 (2023), 126708. Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Zhong Qihuang, Ding Liang, Liu Juhua, Du Bo, and Tao Dacheng. 2023. Can ChatGPT Understand Too? A Comparative Study on ChatGPT and Fine-tuned BERT. arXiv:2302.10198. Retrieved from https://arxiv.org/abs/2302.10198Google ScholarGoogle Scholar
  46. [46] Zhou GuoDong, Zhang Jie, Su Jian, Shen Dan, and Tan ChewLim. 2004. Recognizing names in biomedical texts: A machine learning approach. Bioinformatics 20, 7 (2004), 11781190. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. [47] Zhou Peng, Shi Wei, Tian Jun, Qi Zhenyu, Li Bingchen, Hao Hongwei, and Xu Bo. 2016. Attention-based bidirectional long short-term memory networks for relation classification. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Berlin, Germany, 207212. DOI:Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Fintech Key-Phrase: A New Chinese Financial High-Tech Dataset Accelerating Expression-Level Information Retrieval

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM Transactions on Asian and Low-Resource Language Information Processing
              ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 11
              November 2023
              255 pages
              ISSN:2375-4699
              EISSN:2375-4702
              DOI:10.1145/3633309
              • Editor:
              • Imed Zitouni
              Issue’s Table of Contents

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 22 November 2023
              • Online AM: 1 November 2023
              • Accepted: 11 October 2023
              • Revised: 13 June 2023
              • Received: 28 December 2022
              Published in tallip Volume 22, Issue 11

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
            • Article Metrics

              • Downloads (Last 12 months)268
              • Downloads (Last 6 weeks)53

              Other Metrics

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader

            Full Text

            View this article in Full Text.

            View Full Text