skip to main content
10.1145/3558100.3563850acmconferencesArticle/Chapter ViewAbstractPublication PagesdocengConference Proceedingsconference-collections
short-paper
Open Access

Scholarly big data quality assessment: a case study of document linking and conflation with S2ORC

Published:18 November 2022Publication History

ABSTRACT

Recently, the Allen Institute for Artificial Intelligence released the Semantic Scholar Open Research Corpus (S2ORC), one of the largest open-access scholarly big datasets with more than 130 million scholarly paper records. S2ORC contains a significant portion of automatically generated metadata. The metadata quality could impact downstream tasks such as citation analysis, citation prediction, and link analysis. In this project, we assess the document linking quality and estimate the document conflation rate for the S2ORC dataset. Using semi-automatically curated ground truth corpora, we estimated that the overall document linking quality is high, with 92.6% of documents correctly linking to six major databases, but the linking quality varies depending on subject domains. The document conflation rate is around 2.6%, meaning that about 97.4% of documents are unique. We further quantitatively compared three near-duplicate detection methods using the ground truth created from S2ORC. The experiments indicated that locality-sensitive hashing was the best method in terms of effectiveness and scalability, achieving high performance (F1=0.960) and a much reduced runtime. Our code and data are available at https://github.com/lamps-lab/docconflation.

References

  1. Saleh Rehiel Alenazi, Kamsuriah Ahmad, and Akeem Olowolayemo. 2017. A review of similarity measurement for record duplication detection. In ICEEI.Google ScholarGoogle Scholar
  2. Yen Bui and Jung-ran Park. 2006. An assessment of metadata quality: A case study of the national science digital library metadata repository. In Proceedings of the Annual Conference of CAIS/Actes du congrès annuel de l'ACSI.Google ScholarGoogle Scholar
  3. Li Cai and Yangyong Zhu. 2015. The Challenges of Data Quality and Data Quality Assessment in the Big Data Era. Data Sci. J. 14 (2015), 2.Google ScholarGoogle Scholar
  4. MO Columb and MS Atkinson. 2015. Statistical analysis: sample size and power estimations. BJA Education 16, 5 (2015), 159--161.Google ScholarGoogle ScholarCross RefCross Ref
  5. Abhinandan S. Das, Mayur Datar, Ashutosh Garg, and Shyam Rajaram. 2007. Google News Personalization: Scalable Online Collaborative Filtering. In WWW.Google ScholarGoogle Scholar
  6. Christopher J. Fox, Anany Levitin, and Thomas C. Redman. 1994. The Notion of Data and Its Quality Dimensions. Inf. Process. Manag. 30, 1 (1994), 9--20.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Youming Ge, Jiefeng Wu, Genan Dai, and Yubao Liu. 2019. Text Deduplication with Minimum Loss Ratio. In Proceedings of ICMLC.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. C. Lee Giles, Kurt D. Bollacker, and Steve Lawrence. 1998. CiteSeer: An Automatic Citation Indexing System. In Proceedings of JCDL.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Thomas N. Herzog, Fritz J. Scheuren, and William E. Winkler. 2007. Data quality and record linkage techniques. Springer.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Omid Jafari, Preeti Maurya, Parth Nagarkar, et al. 2021. A Survey on Locality Sensitive Hashing Algorithms and their Applications.Google ScholarGoogle Scholar
  11. Petr Knoth and Zdenek Zdráhal. 2012. CORE: Three Access Levels to Underpin Open Access. D Lib Mag. 18, 11/12 (2012).Google ScholarGoogle Scholar
  12. Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. 2014. Mining of Massive Datasets (2nd ed.). Cambridge University Press, USA.Google ScholarGoogle Scholar
  13. Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kinney, and Daniel Weld. 2020. S2ORC: The Semantic Scholar Open Research Corpus. In Proceedings of ACL.Google ScholarGoogle ScholarCross RefCross Ref
  14. Patrice Lopez. 2009. GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications. In Proceedings of ECDL.Google ScholarGoogle ScholarCross RefCross Ref
  15. Jung-Ran Park. 2009. Metadata Quality in Digital Repositories: A Survey of the Current State of the Art. Cataloging & Classification Quarterly 47, 3--4 (2009).Google ScholarGoogle ScholarCross RefCross Ref
  16. Muhammad Roman, Abdul Shahid, Shafiullah Khan, Anis Koubaa, and Lisu Yu. 2021. Citation Intent Classification Using Word Embedding. IEEE Access 9 (2021).Google ScholarGoogle ScholarCross RefCross Ref
  17. Nees Jan van Eck and Ludo Waltman. 2017. Accuracy of citation data in Web of Science and Scopus. In Proceedings of ISSI.Google ScholarGoogle Scholar
  18. David Wadden, Shanchuan Lin, Kyle Lo, et al. 2020. Fact or Fiction: Verifying Scientific Claims. In Proceedings of EMNLP.Google ScholarGoogle ScholarCross RefCross Ref
  19. Kuansan Wang, Zhihong Shen, et al. 2020. Microsoft Academic Graph: When experts are not enough. Quantitative Science Studies 1, 1 (02 2020), 396--413.Google ScholarGoogle Scholar
  20. Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar, et al. 2020. CORD-19: The Covid-19 Open Research Dataset. CoRR abs/2004.10706 (2020).Google ScholarGoogle Scholar
  21. Kyle Williams and C. Lee Giles. 2013. Near Duplicate Detection in an Academic Digital Library. In Proceedings of DocEng.Google ScholarGoogle Scholar
  22. Jian Wu, Chen Liang, Huaiyu Yang, and C. Lee Giles. 2016. CiteSeerX data: semanticizing scholarly papers. In Proceedings of SBD@SIGMOD.Google ScholarGoogle Scholar
  23. Jian Wu, Pei Wang, Xin Wei, et al. 2020. Acknowledgement Entity Recognition in CORD-19 Papers. In Proceedings of SDP@EMNLP.Google ScholarGoogle ScholarCross RefCross Ref
  24. Feng Xia, Wei Wang, Teshome Megersa Bekele, and Huan Liu. 2017. Big Scholarly Data: A Survey. IEEE Transactions on Big Data 3, 1 (2017), 18--35.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Scholarly big data quality assessment: a case study of document linking and conflation with S2ORC

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            DocEng '22: Proceedings of the 22nd ACM Symposium on Document Engineering
            September 2022
            118 pages
            ISBN:9781450395441
            DOI:10.1145/3558100

            Copyright © 2022 Owner/Author

            This work is licensed under a Creative Commons Attribution International 4.0 License.

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 18 November 2022

            Check for updates

            Qualifiers

            • short-paper

            Acceptance Rates

            Overall Acceptance Rate178of537submissions,33%
          • Article Metrics

            • Downloads (Last 12 months)115
            • Downloads (Last 6 weeks)5

            Other Metrics

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader