Abstract
Expression-level information extraction is a challenging task in natural language processing (NLP), which aims to retrieve crucial semantic information from linguistic documents. However, there is a lack of up-to-date data resources for accelerating expression-level information extraction, particularly in the Chinese financial high technology field. To address this gap, we introduce Fintech Key-Phrase: a human-annotated key-phrase dataset for the Chinese financial high technology domain. This dataset comprises over 12K paragraphs along with annotated domain-specific key-phrases. We extract the publicly released reports, Chinese management’s discussion and analysis (CMD&A), from the renowned Chinese research data services platform (CNRDS) and then filter the reports related to high technology. The high technology key-phrases are annotated following pre-defined philosophy guidelines to ensure annotation quality. In order to better understand the limitations and challenges in the purposed dataset, we conducted comprehensive noise evaluation experiments for the Fintech Key-Phrase, including annotation consistency assessment and absolute annotation quality evaluation. To demonstrate the usefulness of our released Fintech Key-Phrase in retrieving valuable information in the Chinese financial high technology field, we evaluate its significance using several superior information retrieval systems as representative baselines and report corresponding performance statistics. Additionally, we further applied ChatGPT to the text augmentation approach of the Fintech Key-Phrase dataset. Extensive comparative experiments demonstrate that the augmented Fintech Key-Phrase dataset significantly improved the coverage and accuracy of extracting key phrases in the finance and high-tech domains. We believe that this dataset can facilitate scientific research and exploration in the Chinese financial high technology field. We have made the Fintech Key-Phrase dataset and the experimental code of the adopted baselines accessible on Github: https://github.com/albert-jin/Fintech-Key-Phrase. To encourage newcomers to participate in the financial high-tech domain information retrieval research, we have developed a series of tools, including an open website1 and corresponding real-time information retrieval APIs.2
- [1] , Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (Long Beach, California, USA) (NIPS’17). Curran Associates Inc., Red Hook, NY, 6000–6010.Google Scholar
- [2] . 2014. NoSta-D named entity annotation for german: Guidelines and dataset. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14), Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis (Eds.). European Language Resources Association (ELRA), Reykjavik, Iceland, 2524–2531. http://www.lrec-conf.org/proceedings/lrec2014/pdf/276_Paper.pdfGoogle Scholar
- [3] . 2021. Pre-training with whole word masking for chinese BERT. IEEE Transactions on Audio, Speech and Language Processing 29 (2021), 3504–3514.
DOI: Google ScholarDigital Library - [4] . 2023. AugGPT: Leveraging ChatGPT for Text Data Augmentation. arXiv:2302.13007. Retrieved from https://arxiv.org/abs/2302.13007Google Scholar
- [5] . 2019. ArcFace: Additive angular margin loss for deep face recognition. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 4685–4694.
DOI: Google ScholarCross Ref - [6] , Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. Google ScholarCross Ref
- [7] . 1996. Hidden Markov models. Current Opinion in Structural Biology 6, 3 (1996), 361–365.
DOI: Google ScholarCross Ref - [8] . 2020. Dispatched attention with multi-task learning for nested mention recognition. Information Sciences 513 (2020), 241–251.
DOI: Google ScholarDigital Library - [9] . 1973. The viterbi algorithm. Proc. IEEE 61, 3 (1973), 268–278. Google ScholarCross Ref
- [10] . 2022. Joint event causality extraction using dual-channel enhanced neural network. Knowledge-Based Systems 258 (2022), 109935. Google ScholarDigital Library
- [11] . 2023. Exploring the Feasibility of ChatGPT for Event Extraction. arXiv:2303.03836. Retrieved from https://arxiv.org/abs/2303.03836Google Scholar
- [12] . 2021. A multitask framework to detect depression, sentiment and multi-label emotion from suicide notes. Cognitive Computation 14, 1 (2021), 110–129.
DOI: Google ScholarCross Ref - [13] . 2022. CARES: CAuse recognition for emotion in suicide notes. In Proceedings of the Advances in Information Retrieval. , , , , , , and (Eds.), Springer International Publishing, Cham, 128–136. Google ScholarDigital Library
- [14] . 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics. and (Eds.), Vol. 9, PMLR, 249–256.Google Scholar
- [15] . 2015. Delving Deep into Rectifiers: Surpassing Human-level Performance on ImageNet Classification. In 2015 IEEE International Conference on Computer Vision (ICCV). 1026–1034. Google ScholarDigital Library
- [16] . 1997. Long short-term memory. Neural Computation 9 (1997), 1735–80.
DOI: Google ScholarDigital Library - [17] . 2015. Bidirectional LSTM-CRF Models for Sequence Tagging. arXiv:1508.01991. Retrieved from https://arxiv.org/abs/1508.01991Google Scholar
- [18] . 2014. Global vectors for word representation. In Proceedings of the EMNLP. 1532–1543.Google Scholar
- [19] . 2023. Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine. arXiv:2301.08745. Retrieved from https://arxiv.org/abs/2301.08745Google Scholar
- [20] . 2022. Improving embedded knowledge graph multi-hop question answering by introducing relational chain reasoning. Data Mining and Knowledge Discovery 37, 1 (11 Nov 2022), 255–288. Google ScholarDigital Library
- [21] . 2023. Back to common sense: Oxford dictionary descriptive knowledge augmentation for aspect-based sentiment analysis. Information Processing and Management 60, 3 (2023), 103260. Google ScholarDigital Library
- [22] . 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning (ICML’01). Morgan Kaufmann Publishers Inc., San Francisco, CA, 282–289.Google ScholarDigital Library
- [23] . 2022. INForex: Interactive news digest for forex investors. In Proceedings of the Advances in Information Retrieval. , , , , , , and (Eds.), Springer International Publishing, Cham, 300–304. Google ScholarDigital Library
- [24] . 2022. The ChEMU 2022 evaluation campaign: Information extraction in chemical patents. In Proceedings of the Advances in Information Retrieval. , , , , , , and (Eds.), Springer International Publishing, Cham, 400–407. Google ScholarDigital Library
- [25] . 2022. BFF R-CNN: Balanced feature fusion for object detection. IEICE TRANSACTIONS on Information and Systems 105, 8 (2022), 1472–1480.Google ScholarCross Ref
- [26] . 2021. TexSmart: A system for enhanced natural language understanding. In Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP): System Demonstrations.Google ScholarCross Ref
- [27] . 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692. Retrieved from https://arxiv.org/abs/1907.11692Google Scholar
- [28] . 2017. Decoupled weight decay regularization. In International Conference on Learning Representations. https://openreview.net/forum?id=Bkg6RiCqY7Google Scholar
- [29] . 2013. Efficient estimation of word representations in vector space. In 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1301.3781Google Scholar
- [30] . 2019. End-to-end information extraction by character-level embedding and multi-stage attentional U-Net. In Proceedings of the BMVC.Google Scholar
- [31] . 2017. End-to-end information extraction without token-level supervision. In Proceedings of the Workshop on Speech-Centric Natural Language Processing. Association for Computational Linguistics, Copenhagen, Denmark, 48–52.
DOI: Google ScholarCross Ref - [32] . 2019. Heterogeneous-length text topic modeling for reader-aware multi-document summarization. ACM Transactions on Knowledge Discovery from Data 13, 4 (2019), 21 pages.
DOI: Google ScholarDigital Library - [33] . 2015-09-30. Structured Named Entities. Ph. D. Dissertation. The University of Sydney. Retrieved from http://hdl.handle.net/2123/14558Google Scholar
- [34] . 2021. Locate and label: A two-stage identifier for nested named entity recognition. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 2782–2794.Google ScholarCross Ref
- [35] . 2022. Parallel instance query network for named entity recognition. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 947–961.Google ScholarCross Ref
- [36] . 2011. Meta-analysis of Cohen’s kappa. Health Services and Outcomes Research Methodology 11, 3 (2011), 145–163.
DOI: Google ScholarCross Ref - [37] . 2021. Reinforced neighbour feature fusion object detection with deep learning. Symmetry 13, 9 (2021), 1623.Google ScholarCross Ref
- [38] . 2022. Nested named entity recognition: A survey. ACM Transactions on Knowledge Discovery from Data 16, 6, Article
108 (2022), 29 pages.DOI: Google ScholarDigital Library - [39] . 2023. Zero-Shot Information Extraction via Chatting with ChatGPT. arXiv:2302.10205. Retrieved from https://arxiv.org/abs/2302.10205Google Scholar
- [40] . 2023. DAFS: A domain aware few shot generative model for event detection. Machine Learning 112, 3 (2023), 1011–1031.
DOI: Google ScholarDigital Library - [41] . 2023. Statistical Age-of-Information Optimization for Status Update over Multi-State Fading Channels. arXiv:2303.11153. Retrieved from https://arxiv.org/abs/2303.11153Google Scholar
- [42] . 2019. TENER: Adapting Transformer Encoder for Named Entity Recognition. arXiv:1911.04474. Retrieved from https://arxiv.org/abs/1911.04474Google Scholar
- [43] . 2023. Adaptive sampling and transmission for minimizing age of information in metaverse. IEEE Journal on Selected Areas in Communications, Early Access (2023).Google Scholar
- [44] . 2023. ChatAgri: Exploring potentials of ChatGPT on cross-linguistic agricultural text classification. Neurocomputing 557 (2023), 126708. Google ScholarCross Ref
- [45] . 2023. Can ChatGPT Understand Too? A Comparative Study on ChatGPT and Fine-tuned BERT. arXiv:2302.10198. Retrieved from https://arxiv.org/abs/2302.10198Google Scholar
- [46] . 2004. Recognizing names in biomedical texts: A machine learning approach. Bioinformatics 20, 7 (2004), 1178–1190.
DOI: Google ScholarDigital Library - [47] . 2016. Attention-based bidirectional long short-term memory networks for relation classification. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Berlin, Germany, 207–212.
DOI: Google ScholarCross Ref
Index Terms
- Fintech Key-Phrase: A New Chinese Financial High-Tech Dataset Accelerating Expression-Level Information Retrieval
Recommendations
Fintech Key-Phrase: A New Chinese Financial High-Tech Dataset Accelerating Expression-Level Information Retrieval
Database Systems for Advanced ApplicationsAbstractExpression-Level Information Extraction is a challenging Natural Language Processing (NLP) task that aims to retrieve important information from the linguistic documents. However, there still lacks the up-to-date data sources for accelerating the ...
A hybrid Chinese information retrieval model
AMT'10: Proceedings of the 6th international conference on Active media technologyA distinctive feature of Chinese test is that a Chinese document is a sequence of Chinese with no space or boundary between Chinese words. This feature makes Chinese information retrieval more difficult since a retrieved document which contains the ...
REFinD: Relation Extraction Financial Dataset
SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information RetrievalA number of datasets for Relation Extraction (RE) have been created to aide downstream tasks such as information retrieval, semantic search, question answering and textual entailment. However, these datasets fail to capture financial-domain specific ...
Comments