Skip to main content
Log in

Bibliometric-enhanced information retrieval: a novel deep feature engineering approach for algorithm searching from full-text publications

Scientometrics Aims and scope Submit manuscript

Abstract

Recently, tremendous advances have been observed in information retrieval systems designed to search for relevant knowledge in scientific publications. Although these techniques are quite powerful, there is still room for improvement in the area of searching for metadata relating to algorithms in full-text publication datasets—for instance, efficiency-related metrics such as precision, recall, f-measure and accuracy, and other useful metadata such as the datasets deployed and the algorithmic run-time complexity. In this study, we proposed a novel deep learning-based feature engineering approach that improves search capabilities by mining algorithmic-specific metadata from full-text scientific publications. Typically, traditional term frequency-inverse document frequency (TF-IDF)-based approaches function like a ‘bag of words’ model and thus fail to capture either the text’s semantics or the word sequence. In this work, we designed a semantically enriched synopsis of each full-text document by adding algorithmic-specific deep metadata text lines to enhance the search mechanism of algorithm search systems. These text lines are classified by our deployed deep learning-based bi-directional long short term memory (LSTM) model. The designed bi-directional LSTM model outperformed the support vector machine by 9.46%, with a 0.81 f1-score on a dataset of 37,000 algorithm-specific deep metadata text lines that had been tagged by four human experts. Lastly, we present a case study on 21,940 full-text publications downloaded from ACL (https://aclweb.org/) to show the effectiveness of deep learning-based advanced feature engineering search compared to the conventional TF-IDF-based (Lucene) search.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

References

  • Al-Zaidy, R. A., & Giles, C. L. (2017a). A machine learning approach for semantic structuring of scientific charts in scholarly documents. In AAAI (pp. 4644–4649).

  • Al-Zaidy, R. A., & Giles, C. L. (2017b). Automatic knowledge base construction from scholarly documents. In Proceedings of the 2017 ACM symposium on document engineering. ACM (pp. 149–152).

  • Al-Zaidy, R. A., & Giles, C. L. (2018). Extracting semantic relations for scholarly knowledge base construction. In 2018 IEEE 12th international conference on semantic computing (ICSC). IEEE (pp. 56–63).

  • Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35, 1798–1828. https://doi.org/10.1109/TPAMI.2013.50.

    Article  Google Scholar 

  • Bhatia, S., & Mitra, P. (2012). Summarizing figures, tables, and algorithms in scientific publications to augment search results. ACM Transactions on Information Systems (TOIS), 30, 3.

    Article  Google Scholar 

  • Bornmann, L., & Mutz, R. (2015). Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references. Journal of the Association for Information Science and Technology, 66, 2215–2222.

    Article  Google Scholar 

  • Cabanac, G., Frommholz, I., & Mayr, P. (2018). Bibliometric-enhanced information retrieval: Preface. Scientometrics, 116(2), 1225–1227.

    Article  Google Scholar 

  • Chen, Q., Zhu, X., Ling, Z.-H., Wei, S., Jiang, H., & Inkpen, D. (2017). Enhanced LSTM for natural language inference. Association for Computational Linguistics, 14, 1657–1668. https://doi.org/10.18653/v1/P17-1152.

    Google Scholar 

  • Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12, 2493–2537.

    MATH  Google Scholar 

  • Doucet, A., & Coustaty, M. (2017). Enhancing table of contents extraction by system aggregation. In 2017 14th IAPR international conference on document analysis and recognition (ICDAR) (pp. 242–247). IEEE.

  • Hassan, S. U., Akram, A., & Haddawy, P. (2017a). Identifying important citations using contextual information from full text. In Proceedings of the 17th ACM/IEEE joint conference on digital libraries (pp. 41–48). IEEE Press.

  • Hassan, S. U., Imran, M., Iftikhar, T., Safder, I., & Shabbir, M. (2017b). Deep stylometry and lexical and syntactic features based author attribution on PLoS digital repository. In International conference on Asian digital libraries (pp. 119–127). Cham: Springer.

  • Hassan, S. U., Imran, M., Iqbal, S., Aljohani, N. R., & Nawaz, R. (2018a). Deep context of citations using machine-learning models in scholarly full-text articles. Scientometrics, 8, 1–18.

    Google Scholar 

  • Hassan, S. U., Safder, I., Akram, A., & Kamiran, F. (2018b). A novel machine-learning approach to measuring scientific knowledge flows using citation context analysis. Scientometrics, 116(2), 973–996.

    Article  Google Scholar 

  • Heffernan, K., & Teufel, S. (2018). Identifying problems and solutions in scientific text. Scientometrics, 116, 1–16. https://doi.org/10.1007/s11192-018-2718-6.

    Article  Google Scholar 

  • Hingmire, S., Chakraborti, S., Palshikar, G., & Sodani, A. (2017). WikiLDA: Towards more effective knowledge acquisition in topic models using Wikipedia. In Proceedings of the knowledge capture conference (p. 37). ACM.

  • Hingmire, S., Chougule, S., Palshikar, G. K., & Chakraborti, S. (2013). Document classification by topic labeling. In Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval (pp. 877–880). ACM.

  • Huang, M., Qian, Q., & Zhu, X. (2017). Encoding syntactic knowledge in neural networks for sentiment classification. ACM Transactions on Information Systems (TOIS), 35, 26.

    Article  Google Scholar 

  • Karimi, S., Moraes, L., Das, A., Shakery, A., & Verma, R. (2018). Citance-based retrieval and summarization using IR and machine learning. Scientometrics, 116, 1331–1366. https://doi.org/10.1007/s11192-018-2785-8.

    Article  Google Scholar 

  • Khabsa, M., Treeratpituk, P., & Giles, C. L. (2012). AckSeer: A repository and search engine for automatically extracted acknowledgments from digital libraries. In Proceedings of the 12th ACM/IEEE-CS joint conference on digital libraries, JCDL’12 (pp. 185–194). New York: ACM. https://doi.org/10.1145/2232817.2232852.

  • Khan, S., Liu, X., Shakil, K. A., & Alam, M. (2017). A survey on scholarly data: From big data perspective. Information Processing and Management, 53, 923–944.

    Article  Google Scholar 

  • Lai, S., Xu, L., Liu, K., & Zhao, J. (2015). Recurrent convolutional neural networks for text classification. In AAAI (pp. 2267–2273).

  • Lee, P., West, J. D., & Howe, B. (2016). Viziometrix: A platform for analyzing the visual information in big scholarly data. In Proceedings of the 25th international conference companion on World Wide Web. International World Wide Web conferences steering committee (pp. 413–418).

  • Li, C., Xing, J., Sun, A., & Ma, Z. (2016). Effective document labeling with very few seed words: A topic model approach. In Proceedings of the 25th ACM international on conference on information and knowledge management (pp. 85–94). ACM.

  • Lin, Y., Jiang, X., Li, Y., Zhang, J., & Cai, G. (2017). Semi-supervised collective extraction of opinion target and opinion word from online reviews based on active labeling. Journal of Intelligent and Fuzzy Systems, 33, 3949–3958.

    Article  Google Scholar 

  • Ma, S., Xu, J., & Zhang, C. (2018). Automatic identification of cited text spans: A multi-classifier approach over imbalanced dataset. Scientometrics, 116, 1303–1330. https://doi.org/10.1007/s11192-018-2754-2.

    Article  Google Scholar 

  • Mayr, P., Frommholz, I., Cabanac, G., Chandrasekaran, M. K., Jaidka, K., Kan, M. Y., et al. (2018). Introduction to the special issue on bibliometric-enhanced information retrieval and natural language processing for digital libraries (BIRNDL). International Journal on Digital Libraries, 19(2–3), 107–111.

    Article  Google Scholar 

  • Mesbah, S., Fragkeskos, K., Lofi, C., Bozzon, A., & Houben, G.-J. (2017). Semantic annotation of data processing pipelines in scientific publications. In European semantic web conference (pp. 321–336). Berlin: Springer.

  • Mitra, P., Giles, C. L., Sun, B., & Liu, Y. (2007). ChemXSeer: A digital library and data repository for chemical kinetics. In Proceedings of the ACM first workshop on cyber infrastructure: Information management in EScience, CIMS’07 (pp. 7–10). New York: ACM. https://doi.org/10.1145/1317353.1317356.

  • Osborne, F., Mannocci, A., & Motta, E. (2017). Forecasting the spreading of technologies in research communities. In Proceedings of the knowledge capture conference (p. 1). ACM.

  • Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M. M., & Gatford, M. (1995). Okapi at TREC-3. NIST Special Publication Specification, 109, 109.

    Google Scholar 

  • Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic keyword extraction from individual documents. Text Mining: Applications and Theory, 1, 1–20.

    Google Scholar 

  • Safder, I., & Hassan, S. U. (2018). DS4A: Deep search system for algorithms from full-text scholarly big data. In International conference on data mining workshop (pp. 1308–1315).

  • Safder, I., Hassan, S.-U., Aljohani, N. R. (2018). AI cognition in searching for relevant knowledge from scholarly big data, using a multi-layer perceptron and recurrent convolutional neural network model. In Companion of the web conference 2018. International World Wide Web conferences steering committee (pp. 251–258)s.

  • Safder, I., Sarfraz, J., Hassan, S.-U., Ali, M., & Tuarob, S. (2017). Detecting target text related to algorithmic efficiency in scholarly big data using recurrent convolutional neural network model. In International conference on Asian digital libraries (pp. 30–40). Berlin: Springer.

  • Siegel, N., Horvitz, Z., Levin, R., Divvala, S., & Farhadi, A. (2016). FigureSeer: Parsing result-figures in research papers. In Computer vision—ECCV 2016, lecture notes in computer science. Presented at the European conference on computer vision (pp. 664–680). Cham: Springer. https://doi.org/10.1007/978-3-319-46478-7_41.

  • Siegel, N., Lourie, N., Power, R., & Ammar, W. (2018). Extracting scientific figures with distantly supervised neural networks. In Proceedings of the 18th ACM/IEEE on joint conference on digital libraries (pp. 223–232). ACM.

  • Tuarob, S. (2016). Improving pseudo-code detection in ubiquitous scholarly data using ensemble machine learning. In 2016 International on computer science and engineering conference (ICSEC) (pp. 1–6). IEEE.

  • Tuarob, S., Bhatia, S., Mitra, P., & Giles, C. L. (2013). Automatic detection of pseudocodes in scholarly documents using machine learning. In 2013 12th international conference on document analysis and recognition (ICDAR) (pp. 738–742). IEEE.

  • Tuarob, S., Bhatia, S., Mitra, P., & Giles, C. L. (2016). AlgorithmSeer: A system for extracting and searching for algorithms in scholarly big data. The IEEE Transactions on Big Data, 2, 3–17.

    Article  Google Scholar 

  • Tuarob, S., Mitra, P., & Giles, C. L. (2015). A hybrid approach to discover semantic hierarchical sections in scholarly documents. In 2015 13th international conference on document analysis and recognition (ICDAR) (pp. 1081–1085). IEEE.

  • Wang, C., Jiang, F., & Yang, H. (2017). A hybrid framework for text modeling with convolutional RNN. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining (pp. 2061–2069). ACM.

  • Xia, F., Wang, W., Bekele, T. M., & Liu, H. (2017). Big scholarly data: A survey. The IEEE Transactions on Big Data, 3, 18–35.

    Article  Google Scholar 

Download references

Acknowledgements

This research work is supported by the NRPU Grant # 6857, funded by the Higher Education Commission of Pakistan.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Iqra Safder.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Safder, I., Hassan, SU. Bibliometric-enhanced information retrieval: a novel deep feature engineering approach for algorithm searching from full-text publications. Scientometrics 119, 257–277 (2019). https://doi.org/10.1007/s11192-019-03025-y

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-019-03025-y

Keywords

Navigation