research-article

Fintech Key-Phrase: A New Chinese Financial High-Tech Dataset Accelerating Expression-Level Information Retrieval

Authors:
Weiqiang Jin

School of Information and Communications Engineering, Xi’an Jiaotong University, China

School of Information and Communications Engineering, Xi’an Jiaotong University, China

0000-0002-6656-6061
View Profile

,
Biao Zhao

School of Information and Communications Engineering, Xi’an Jiaotong University, China

School of Information and Communications Engineering, Xi’an Jiaotong University, China

0000-0002-3651-0702
View Profile

,
Yu Zhang

School of Information and Communications Engineering, Xi’an Jiaotong University, China

School of Information and Communications Engineering, Xi’an Jiaotong University, China

0009-0006-2971-7840
View Profile

,
Gege Sun

School of Information and Communications Engineering, Xi’an Jiaotong University, China

School of Information and Communications Engineering, Xi’an Jiaotong University, China

0009-0000-2921-7877
View Profile

,
Hang Yu

School of Computer Engineering and Science, Shanghai University, China

School of Computer Engineering and Science, Shanghai University, China

0000-0003-3444-9992
View Profile

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 22 Issue 11Article No.: 253pp 1–37https://doi.org/10.1145/3627989

Published:22 November 2023Publication History

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

Expression-level information extraction is a challenging task in natural language processing (NLP), which aims to retrieve crucial semantic information from linguistic documents. However, there is a lack of up-to-date data resources for accelerating expression-level information extraction, particularly in the Chinese financial high technology field. To address this gap, we introduce Fintech Key-Phrase: a human-annotated key-phrase dataset for the Chinese financial high technology domain. This dataset comprises over 12K paragraphs along with annotated domain-specific key-phrases. We extract the publicly released reports, Chinese management’s discussion and analysis (CMD&A), from the renowned Chinese research data services platform (CNRDS) and then filter the reports related to high technology. The high technology key-phrases are annotated following pre-defined philosophy guidelines to ensure annotation quality. In order to better understand the limitations and challenges in the purposed dataset, we conducted comprehensive noise evaluation experiments for the Fintech Key-Phrase, including annotation consistency assessment and absolute annotation quality evaluation. To demonstrate the usefulness of our released Fintech Key-Phrase in retrieving valuable information in the Chinese financial high technology field, we evaluate its significance using several superior information retrieval systems as representative baselines and report corresponding performance statistics. Additionally, we further applied ChatGPT to the text augmentation approach of the Fintech Key-Phrase dataset. Extensive comparative experiments demonstrate that the augmented Fintech Key-Phrase dataset significantly improved the coverage and accuracy of extracting key phrases in the finance and high-tech domains. We believe that this dataset can facilitate scientific research and exploration in the Chinese financial high technology field. We have made the Fintech Key-Phrase dataset and the experimental code of the adopted baselines accessible on Github: https://github.com/albert-jin/Fintech-Key-Phrase. To encourage newcomers to participate in the financial high-tech domain information retrieval research, we have developed a series of tools, including an open website¹ and corresponding real-time information retrieval APIs.²

REFERENCES

[1] Vaswani Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (Long Beach, California, USA) (NIPS’17). Curran Associates Inc., Red Hook, NY, 6000–6010.Google Scholar
[2] Benikova Darina, Biemann Chris, and Reznicek Marc. 2014. NoSta-D named entity annotation for german: Guidelines and dataset. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14), Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis (Eds.). European Language Resources Association (ELRA), Reykjavik, Iceland, 2524–2531. http://www.lrec-conf.org/proceedings/lrec2014/pdf/276_Paper.pdfGoogle Scholar
[3] Cui Yiming, Che Wanxiang, Liu Ting, Qin Bing, and Yang Ziqing. 2021. Pre-training with whole word masking for chinese BERT. IEEE Transactions on Audio, Speech and Language Processing 29 (2021), 3504–3514. DOI:Google ScholarDigital Library
[4] Dai Haixing, Liu Zhengliang, Liao Wenxiong, Huang Xiaoke, Cao Yihan, Wu Zihao, Zhao Lin, Xu Shaochen, Liu Wei, Liu Ninghao, Li Sheng, Zhu Dajiang, Cai Hongmin, Sun Lichao, Li Quanzheng, Shen Dinggang, Liu Tianming, and Li Xiang. 2023. AugGPT: Leveraging ChatGPT for Text Data Augmentation. arXiv:2302.13007. Retrieved from https://arxiv.org/abs/2302.13007Google Scholar
[5] Deng Jiankang, Guo Jia, Xue Niannan, and Zafeiriou Stefanos. 2019. ArcFace: Additive angular margin loss for deep face recognition. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 4685–4694. DOI:Google ScholarCross Ref
[6] Devlin Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. Google ScholarCross Ref
[7] Eddy Sean R.. 1996. Hidden Markov models. Current Opinion in Structural Biology 6, 3 (1996), 361–365. DOI:Google ScholarCross Ref
[8] Fei Hao, Ren Yafeng, and Ji Donghong. 2020. Dispatched attention with multi-task learning for nested mention recognition. Information Sciences 513 (2020), 241–251. DOI:Google ScholarDigital Library
[9] Forney Jr. G. D.. 1973. The viterbi algorithm. Proc. IEEE 61, 3 (1973), 268–278. Google ScholarCross Ref
[10] Gao Jianqi, Yu Hang, and Zhang Shuang. 2022. Joint event causality extraction using dual-channel enhanced neural network. Knowledge-Based Systems 258 (2022), 109935. Google ScholarDigital Library
[11] Gao Jun, Zhao Huan, Yu Changlong, and Xu Ruifeng. 2023. Exploring the Feasibility of ChatGPT for Event Extraction. arXiv:2303.03836. Retrieved from https://arxiv.org/abs/2303.03836Google Scholar
[12] Ghosh Soumitra, Ekbal Asif, and Bhattacharyya Pushpak. 2021. A multitask framework to detect depression, sentiment and multi-label emotion from suicide notes. Cognitive Computation 14, 1 (2021), 110–129. DOI:Google ScholarCross Ref
[13] Ghosh Soumitra, Roy Swarup, Ekbal Asif, and Bhattacharyya Pushpak. 2022. CARES: CAuse recognition for emotion in suicide notes. In Proceedings of the Advances in Information Retrieval. Hagen Matthias, Verberne Suzan, Macdonald Craig, Seifert Christin, Balog Krisztian, Nørvåg Kjetil, and Setty Vinay (Eds.), Springer International Publishing, Cham, 128–136. Google ScholarDigital Library
[14] Glorot and Bengio Yoshua. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics. Bengio Yoshua and LeCun Yann (Eds.), Vol. 9, PMLR, 249–256.Google Scholar
[15] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2015. Delving Deep into Rectifiers: Surpassing Human-level Performance on ImageNet Classification. In 2015 IEEE International Conference on Computer Vision (ICCV). 1026–1034. Google ScholarDigital Library
[16] Hochreiter Sepp and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9 (1997), 1735–80. DOI:Google ScholarDigital Library
[17] Huang Zhiheng, Xu Wei, and Yu Kai. 2015. Bidirectional LSTM-CRF Models for Sequence Tagging. arXiv:1508.01991. Retrieved from https://arxiv.org/abs/1508.01991Google Scholar
[18] Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Global vectors for word representation. In Proceedings of the EMNLP. 1532–1543.Google Scholar
[19] Jiao Wenxiang, Wang Wenxuan, Huang Jen tse, Wang Xing, and Tu Zhaopeng. 2023. Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine. arXiv:2301.08745. Retrieved from https://arxiv.org/abs/2301.08745Google Scholar
[20] Jin Weiqiang, Zhao Biao, Yu Hang, Tao Xi, Yin Ruiping, and Liu Guizhong. 2022. Improving embedded knowledge graph multi-hop question answering by introducing relational chain reasoning. Data Mining and Knowledge Discovery 37, 1 (11 Nov 2022), 255–288. Google ScholarDigital Library
[21] Jin Weiqiang, Zhao Biao, Zhang Liwen, Liu Chenxing, and Yu Hang. 2023. Back to common sense: Oxford dictionary descriptive knowledge augmentation for aspect-based sentiment analysis. Information Processing and Management 60, 3 (2023), 103260. Google ScholarDigital Library
[22] Lafferty John D., McCallum Andrew, and Pereira Fernando C. N.. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning (ICML’01). Morgan Kaufmann Publishers Inc., San Francisco, CA, 282–289.Google ScholarDigital Library
[23] Lee Chih-Hen, Chiang Yi-Shyuan, and Wang Chuan-Ju. 2022. INForex: Interactive news digest for forex investors. In Proceedings of the Advances in Information Retrieval. Hagen Matthias, Verberne Suzan, Macdonald Craig, Seifert Christin, Balog Krisztian, Nørvåg Kjetil, and Setty Vinay (Eds.), Springer International Publishing, Cham, 300–304. Google ScholarDigital Library
[24] Li Yuan, Fang Biaoyan, He Jiayuan, Yoshikawa Hiyori, Akhondi Saber A., Druckenbrodt Christian, Thorne Camilo, Zhai Zenan, Afzal Zubair, Cohn Trevor, Baldwin Timothy, and Verspoor Karin. 2022. The ChEMU 2022 evaluation campaign: Information extraction in chemical patents. In Proceedings of the Advances in Information Retrieval. Hagen Matthias, Verberne Suzan, Macdonald Craig, Seifert Christin, Balog Krisztian, Nørvåg Kjetil, and Setty Vinay (Eds.), Springer International Publishing, Cham, 400–407. Google ScholarDigital Library
[25] Liu Hongzhe, Wang Ningwei, Li Xuewei, Xu Cheng, and Li Yaze. 2022. BFF R-CNN: Balanced feature fusion for object detection. IEICE TRANSACTIONS on Information and Systems 105, 8 (2022), 1472–1480.Google ScholarCross Ref
[26] Liu Lemao, Zhang Haisong, Jiang Haiyun, Li Yangming, Zhao Enbo, Xu Kun, Song Linfeng, Zheng Suncong, Zhou Botong, Zhu Jianchen, Feng Xiao, Chen Tao, Yang Tao, Yu Dong, Zhang Feng, Kang Zhanhui, and Shi Shuming. 2021. TexSmart: A system for enhanced natural language understanding. In Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP): System Demonstrations.Google ScholarCross Ref
[27] Liu Yinhan, Ott Myle, Goyal Naman, Du Jingfei, Joshi Mandar, Chen Danqi, Levy Omer, Lewis Mike, Zettlemoyer Luke, and Stoyanov Veselin. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692. Retrieved from https://arxiv.org/abs/1907.11692Google Scholar
[28] Loshchilov Ilya and Hutter Frank. 2017. Decoupled weight decay regularization. In International Conference on Learning Representations. https://openreview.net/forum?id=Bkg6RiCqY7Google Scholar
[29] Mikolov Tomás, Chen Kai, Corrado Greg, and Dean Jeffrey. 2013. Efficient estimation of word representations in vector space. In 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1301.3781Google Scholar
[30] Nguyen Tuan-Anh Dang and Thanh Dat Nguyen. 2019. End-to-end information extraction by character-level embedding and multi-stage attentional U-Net. In Proceedings of the BMVC.Google Scholar
[31] Palm Rasmus Berg, Hovy Dirk, Laws Florian, and Winther Ole. 2017. End-to-end information extraction without token-level supervision. In Proceedings of the Workshop on Speech-Centric Natural Language Processing. Association for Computational Linguistics, Copenhagen, Denmark, 48–52. DOI:Google ScholarCross Ref
[32] Qiang Jipeng, Chen Ping, Ding Wei, Wang Tong, Xie Fei, and Wu Xindong. 2019. Heterogeneous-length text topic modeling for reader-aware multi-document summarization. ACM Transactions on Knowledge Discovery from Data 13, 4 (2019), 21 pages. DOI:Google ScholarDigital Library
[33] Ringland Nicola. 2015-09-30. Structured Named Entities. Ph. D. Dissertation. The University of Sydney. Retrieved from http://hdl.handle.net/2123/14558Google Scholar
[34] Shen Yongliang, Ma Xinyin, Tan Zeqi, Zhang Shuai, Wang Wen, and Lu Weiming. 2021. Locate and label: A two-stage identifier for nested named entity recognition. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 2782–2794.Google ScholarCross Ref
[35] Shen Yongliang, Wang Xiaobin, Tan Zeqi, Xu Guangwei, Xie Pengjun, Huang Fei, Lu Weiming, and Zhuang Yueting. 2022. Parallel instance query network for named entity recognition. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 947–961.Google ScholarCross Ref
[36] Sun Shuyan. 2011. Meta-analysis of Cohen’s kappa. Health Services and Outcomes Research Methodology 11, 3 (2011), 145–163. DOI:Google ScholarCross Ref
[37] Wang Ningwei, Li Yaze, and Liu Hongzhe. 2021. Reinforced neighbour feature fusion object detection with deep learning. Symmetry 13, 9 (2021), 1623.Google ScholarCross Ref
[38] Wang Yu, Tong Hanghang, Zhu Ziye, and Li Yun. 2022. Nested named entity recognition: A survey. ACM Transactions on Knowledge Discovery from Data 16, 6, Article 108 (2022), 29 pages. DOI:Google ScholarDigital Library
[39] Wei Xiang, Cui Xingyu, Cheng Ning, Wang Xiaobin, Zhang Xin, Huang Shen, Xie Pengjun, Xu Jinan, Chen Yufeng, Zhang Meishan, Jiang Yong, and Han Wenjuan. 2023. Zero-Shot Information Extraction via Chatting with ChatGPT. arXiv:2302.10205. Retrieved from https://arxiv.org/abs/2302.10205Google Scholar
[40] Xia Nan, Yu Hang, Wang Yin, Xuan Junyu, and Luo Xiangfeng. 2023. DAFS: A domain aware few shot generative model for event detection. Machine Learning 112, 3 (2023), 1011–1031. DOI:Google ScholarDigital Library
[41] Xiao Yuquan and Du Qinghe. 2023. Statistical Age-of-Information Optimization for Status Update over Multi-State Fading Channels. arXiv:2303.11153. Retrieved from https://arxiv.org/abs/2303.11153Google Scholar
[42] Yan Hang, Deng Bocao, Li Xiaonan, and Qiu Xipeng. 2019. TENER: Adapting Transformer Encoder for Named Entity Recognition. arXiv:1911.04474. Retrieved from https://arxiv.org/abs/1911.04474Google Scholar
[43] Yuquan Xiao, Qinghe Du, Wenchi Cheng, and Wei Zhang. 2023. Adaptive sampling and transmission for minimizing age of information in metaverse. IEEE Journal on Selected Areas in Communications, Early Access (2023).Google Scholar
[44] Zhao Biao, Jin Weiqiang, Ser Javier Del, and Yang Guang. 2023. ChatAgri: Exploring potentials of ChatGPT on cross-linguistic agricultural text classification. Neurocomputing 557 (2023), 126708. Google ScholarCross Ref
[45] Zhong Qihuang, Ding Liang, Liu Juhua, Du Bo, and Tao Dacheng. 2023. Can ChatGPT Understand Too? A Comparative Study on ChatGPT and Fine-tuned BERT. arXiv:2302.10198. Retrieved from https://arxiv.org/abs/2302.10198Google Scholar
[46] Zhou GuoDong, Zhang Jie, Su Jian, Shen Dan, and Tan ChewLim. 2004. Recognizing names in biomedical texts: A machine learning approach. Bioinformatics 20, 7 (2004), 1178–1190. DOI:Google ScholarDigital Library
[47] Zhou Peng, Shi Wei, Tian Jun, Qi Zhenyu, Li Bingchen, Hao Hongwei, and Xu Bo. 2016. Attention-based bidirectional long short-term memory networks for relation classification. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Berlin, Germany, 207–212. DOI:Google ScholarCross Ref

Index Terms

Fintech Key-Phrase: A New Chinese Financial High-Tech Dataset Accelerating Expression-Level Information Retrieval
1. Applied computing
  1. Electronic commerce
    1. Electronic data interchange
2. Information systems
  1. Information retrieval
  2. World Wide Web
    1. Web mining
      1. Data extraction and integration

Recommendations

Fintech Key-Phrase: A New Chinese Financial High-Tech Dataset Accelerating Expression-Level Information Retrieval
Database Systems for Advanced Applications
Abstract
Expression-Level Information Extraction is a challenging Natural Language Processing (NLP) task that aims to retrieve important information from the linguistic documents. However, there still lacks the up-to-date data sources for accelerating the ...
Read More
A hybrid Chinese information retrieval model
AMT'10: Proceedings of the 6th international conference on Active media technology

A distinctive feature of Chinese test is that a Chinese document is a sequence of Chinese with no space or boundary between Chinese words. This feature makes Chinese information retrieval more difficult since a retrieved document which contains the ...
Read More
REFinD: Relation Extraction Financial Dataset
SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

A number of datasets for Relation Extraction (RE) have been created to aide downstream tasks such as information retrieval, semantic search, question answering and textual entailment. However, these datasets fail to capture financial-domain specific ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Asian and Low-Resource Language Information Processing Volume 22, Issue 11
November 2023
255 pages
ISSN:2375-4699
EISSN:2375-4702
DOI:10.1145/3633309
Editor:
Imed Zitouni
Google, USA
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 November 2023
- Online AM: 1 November 2023
- Accepted: 11 October 2023
- Revised: 13 June 2023
- Received: 28 December 2022
Published in tallip Volume 22, Issue 11

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Information retrieval
expression-level information extraction
financial high technology field
Chinese management’s discussion and analysis
ChatGPT-based data augment
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 268
  Total Downloads
- Downloads (Last 12 months)268
- Downloads (Last 6 weeks)53
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

Fintech Key-Phrase: A New Chinese Financial High-Tech Dataset Accelerating Expression-Level Information Retrieval

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Fintech Key-Phrase: A New Chinese Financial High-Tech Dataset Accelerating Expression-Level Information Retrieval

A hybrid Chinese information retrieval model

REFinD: Relation Extraction Financial Dataset

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Full Text

Caption

Fintech Key-Phrase: A New Chinese Financial High-Tech Dataset Accelerating Expression-Level Information Retrieval

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Fintech Key-Phrase: A New Chinese Financial High-Tech Dataset Accelerating Expression-Level Information Retrieval

A hybrid Chinese information retrieval model

REFinD: Relation Extraction Financial Dataset

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Full Text

Share this Publication link

Share on Social Media