Abstract
Recently, tremendous advances have been observed in information retrieval systems designed to search for relevant knowledge in scientific publications. Although these techniques are quite powerful, there is still room for improvement in the area of searching for metadata relating to algorithms in full-text publication datasets—for instance, efficiency-related metrics such as precision, recall, f-measure and accuracy, and other useful metadata such as the datasets deployed and the algorithmic run-time complexity. In this study, we proposed a novel deep learning-based feature engineering approach that improves search capabilities by mining algorithmic-specific metadata from full-text scientific publications. Typically, traditional term frequency-inverse document frequency (TF-IDF)-based approaches function like a ‘bag of words’ model and thus fail to capture either the text’s semantics or the word sequence. In this work, we designed a semantically enriched synopsis of each full-text document by adding algorithmic-specific deep metadata text lines to enhance the search mechanism of algorithm search systems. These text lines are classified by our deployed deep learning-based bi-directional long short term memory (LSTM) model. The designed bi-directional LSTM model outperformed the support vector machine by 9.46%, with a 0.81 f1-score on a dataset of 37,000 algorithm-specific deep metadata text lines that had been tagged by four human experts. Lastly, we present a case study on 21,940 full-text publications downloaded from ACL (https://aclweb.org/) to show the effectiveness of deep learning-based advanced feature engineering search compared to the conventional TF-IDF-based (Lucene) search.
References
Al-Zaidy, R. A., & Giles, C. L. (2017a). A machine learning approach for semantic structuring of scientific charts in scholarly documents. In AAAI (pp. 4644–4649).
Al-Zaidy, R. A., & Giles, C. L. (2017b). Automatic knowledge base construction from scholarly documents. In Proceedings of the 2017 ACM symposium on document engineering. ACM (pp. 149–152).
Al-Zaidy, R. A., & Giles, C. L. (2018). Extracting semantic relations for scholarly knowledge base construction. In 2018 IEEE 12th international conference on semantic computing (ICSC). IEEE (pp. 56–63).
Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35, 1798–1828. https://doi.org/10.1109/TPAMI.2013.50.
Bhatia, S., & Mitra, P. (2012). Summarizing figures, tables, and algorithms in scientific publications to augment search results. ACM Transactions on Information Systems (TOIS), 30, 3.
Bornmann, L., & Mutz, R. (2015). Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references. Journal of the Association for Information Science and Technology, 66, 2215–2222.
Cabanac, G., Frommholz, I., & Mayr, P. (2018). Bibliometric-enhanced information retrieval: Preface. Scientometrics, 116(2), 1225–1227.
Chen, Q., Zhu, X., Ling, Z.-H., Wei, S., Jiang, H., & Inkpen, D. (2017). Enhanced LSTM for natural language inference. Association for Computational Linguistics, 14, 1657–1668. https://doi.org/10.18653/v1/P17-1152.
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12, 2493–2537.
Doucet, A., & Coustaty, M. (2017). Enhancing table of contents extraction by system aggregation. In 2017 14th IAPR international conference on document analysis and recognition (ICDAR) (pp. 242–247). IEEE.
Hassan, S. U., Akram, A., & Haddawy, P. (2017a). Identifying important citations using contextual information from full text. In Proceedings of the 17th ACM/IEEE joint conference on digital libraries (pp. 41–48). IEEE Press.
Hassan, S. U., Imran, M., Iftikhar, T., Safder, I., & Shabbir, M. (2017b). Deep stylometry and lexical and syntactic features based author attribution on PLoS digital repository. In International conference on Asian digital libraries (pp. 119–127). Cham: Springer.
Hassan, S. U., Imran, M., Iqbal, S., Aljohani, N. R., & Nawaz, R. (2018a). Deep context of citations using machine-learning models in scholarly full-text articles. Scientometrics, 8, 1–18.
Hassan, S. U., Safder, I., Akram, A., & Kamiran, F. (2018b). A novel machine-learning approach to measuring scientific knowledge flows using citation context analysis. Scientometrics, 116(2), 973–996.
Heffernan, K., & Teufel, S. (2018). Identifying problems and solutions in scientific text. Scientometrics, 116, 1–16. https://doi.org/10.1007/s11192-018-2718-6.
Hingmire, S., Chakraborti, S., Palshikar, G., & Sodani, A. (2017). WikiLDA: Towards more effective knowledge acquisition in topic models using Wikipedia. In Proceedings of the knowledge capture conference (p. 37). ACM.
Hingmire, S., Chougule, S., Palshikar, G. K., & Chakraborti, S. (2013). Document classification by topic labeling. In Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval (pp. 877–880). ACM.
Huang, M., Qian, Q., & Zhu, X. (2017). Encoding syntactic knowledge in neural networks for sentiment classification. ACM Transactions on Information Systems (TOIS), 35, 26.
Karimi, S., Moraes, L., Das, A., Shakery, A., & Verma, R. (2018). Citance-based retrieval and summarization using IR and machine learning. Scientometrics, 116, 1331–1366. https://doi.org/10.1007/s11192-018-2785-8.
Khabsa, M., Treeratpituk, P., & Giles, C. L. (2012). AckSeer: A repository and search engine for automatically extracted acknowledgments from digital libraries. In Proceedings of the 12th ACM/IEEE-CS joint conference on digital libraries, JCDL’12 (pp. 185–194). New York: ACM. https://doi.org/10.1145/2232817.2232852.
Khan, S., Liu, X., Shakil, K. A., & Alam, M. (2017). A survey on scholarly data: From big data perspective. Information Processing and Management, 53, 923–944.
Lai, S., Xu, L., Liu, K., & Zhao, J. (2015). Recurrent convolutional neural networks for text classification. In AAAI (pp. 2267–2273).
Lee, P., West, J. D., & Howe, B. (2016). Viziometrix: A platform for analyzing the visual information in big scholarly data. In Proceedings of the 25th international conference companion on World Wide Web. International World Wide Web conferences steering committee (pp. 413–418).
Li, C., Xing, J., Sun, A., & Ma, Z. (2016). Effective document labeling with very few seed words: A topic model approach. In Proceedings of the 25th ACM international on conference on information and knowledge management (pp. 85–94). ACM.
Lin, Y., Jiang, X., Li, Y., Zhang, J., & Cai, G. (2017). Semi-supervised collective extraction of opinion target and opinion word from online reviews based on active labeling. Journal of Intelligent and Fuzzy Systems, 33, 3949–3958.
Ma, S., Xu, J., & Zhang, C. (2018). Automatic identification of cited text spans: A multi-classifier approach over imbalanced dataset. Scientometrics, 116, 1303–1330. https://doi.org/10.1007/s11192-018-2754-2.
Mayr, P., Frommholz, I., Cabanac, G., Chandrasekaran, M. K., Jaidka, K., Kan, M. Y., et al. (2018). Introduction to the special issue on bibliometric-enhanced information retrieval and natural language processing for digital libraries (BIRNDL). International Journal on Digital Libraries, 19(2–3), 107–111.
Mesbah, S., Fragkeskos, K., Lofi, C., Bozzon, A., & Houben, G.-J. (2017). Semantic annotation of data processing pipelines in scientific publications. In European semantic web conference (pp. 321–336). Berlin: Springer.
Mitra, P., Giles, C. L., Sun, B., & Liu, Y. (2007). ChemXSeer: A digital library and data repository for chemical kinetics. In Proceedings of the ACM first workshop on cyber infrastructure: Information management in EScience, CIMS’07 (pp. 7–10). New York: ACM. https://doi.org/10.1145/1317353.1317356.
Osborne, F., Mannocci, A., & Motta, E. (2017). Forecasting the spreading of technologies in research communities. In Proceedings of the knowledge capture conference (p. 1). ACM.
Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M. M., & Gatford, M. (1995). Okapi at TREC-3. NIST Special Publication Specification, 109, 109.
Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic keyword extraction from individual documents. Text Mining: Applications and Theory, 1, 1–20.
Safder, I., & Hassan, S. U. (2018). DS4A: Deep search system for algorithms from full-text scholarly big data. In International conference on data mining workshop (pp. 1308–1315).
Safder, I., Hassan, S.-U., Aljohani, N. R. (2018). AI cognition in searching for relevant knowledge from scholarly big data, using a multi-layer perceptron and recurrent convolutional neural network model. In Companion of the web conference 2018. International World Wide Web conferences steering committee (pp. 251–258)s.
Safder, I., Sarfraz, J., Hassan, S.-U., Ali, M., & Tuarob, S. (2017). Detecting target text related to algorithmic efficiency in scholarly big data using recurrent convolutional neural network model. In International conference on Asian digital libraries (pp. 30–40). Berlin: Springer.
Siegel, N., Horvitz, Z., Levin, R., Divvala, S., & Farhadi, A. (2016). FigureSeer: Parsing result-figures in research papers. In Computer vision—ECCV 2016, lecture notes in computer science. Presented at the European conference on computer vision (pp. 664–680). Cham: Springer. https://doi.org/10.1007/978-3-319-46478-7_41.
Siegel, N., Lourie, N., Power, R., & Ammar, W. (2018). Extracting scientific figures with distantly supervised neural networks. In Proceedings of the 18th ACM/IEEE on joint conference on digital libraries (pp. 223–232). ACM.
Tuarob, S. (2016). Improving pseudo-code detection in ubiquitous scholarly data using ensemble machine learning. In 2016 International on computer science and engineering conference (ICSEC) (pp. 1–6). IEEE.
Tuarob, S., Bhatia, S., Mitra, P., & Giles, C. L. (2013). Automatic detection of pseudocodes in scholarly documents using machine learning. In 2013 12th international conference on document analysis and recognition (ICDAR) (pp. 738–742). IEEE.
Tuarob, S., Bhatia, S., Mitra, P., & Giles, C. L. (2016). AlgorithmSeer: A system for extracting and searching for algorithms in scholarly big data. The IEEE Transactions on Big Data, 2, 3–17.
Tuarob, S., Mitra, P., & Giles, C. L. (2015). A hybrid approach to discover semantic hierarchical sections in scholarly documents. In 2015 13th international conference on document analysis and recognition (ICDAR) (pp. 1081–1085). IEEE.
Wang, C., Jiang, F., & Yang, H. (2017). A hybrid framework for text modeling with convolutional RNN. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining (pp. 2061–2069). ACM.
Xia, F., Wang, W., Bekele, T. M., & Liu, H. (2017). Big scholarly data: A survey. The IEEE Transactions on Big Data, 3, 18–35.
Acknowledgements
This research work is supported by the NRPU Grant # 6857, funded by the Higher Education Commission of Pakistan.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Safder, I., Hassan, SU. Bibliometric-enhanced information retrieval: a novel deep feature engineering approach for algorithm searching from full-text publications. Scientometrics 119, 257–277 (2019). https://doi.org/10.1007/s11192-019-03025-y
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-019-03025-y