ABSTRACT
Recently, the Allen Institute for Artificial Intelligence released the Semantic Scholar Open Research Corpus (S2ORC), one of the largest open-access scholarly big datasets with more than 130 million scholarly paper records. S2ORC contains a significant portion of automatically generated metadata. The metadata quality could impact downstream tasks such as citation analysis, citation prediction, and link analysis. In this project, we assess the document linking quality and estimate the document conflation rate for the S2ORC dataset. Using semi-automatically curated ground truth corpora, we estimated that the overall document linking quality is high, with 92.6% of documents correctly linking to six major databases, but the linking quality varies depending on subject domains. The document conflation rate is around 2.6%, meaning that about 97.4% of documents are unique. We further quantitatively compared three near-duplicate detection methods using the ground truth created from S2ORC. The experiments indicated that locality-sensitive hashing was the best method in terms of effectiveness and scalability, achieving high performance (F1=0.960) and a much reduced runtime. Our code and data are available at https://github.com/lamps-lab/docconflation.
- Saleh Rehiel Alenazi, Kamsuriah Ahmad, and Akeem Olowolayemo. 2017. A review of similarity measurement for record duplication detection. In ICEEI.Google Scholar
- Yen Bui and Jung-ran Park. 2006. An assessment of metadata quality: A case study of the national science digital library metadata repository. In Proceedings of the Annual Conference of CAIS/Actes du congrès annuel de l'ACSI.Google Scholar
- Li Cai and Yangyong Zhu. 2015. The Challenges of Data Quality and Data Quality Assessment in the Big Data Era. Data Sci. J. 14 (2015), 2.Google Scholar
- MO Columb and MS Atkinson. 2015. Statistical analysis: sample size and power estimations. BJA Education 16, 5 (2015), 159--161.Google ScholarCross Ref
- Abhinandan S. Das, Mayur Datar, Ashutosh Garg, and Shyam Rajaram. 2007. Google News Personalization: Scalable Online Collaborative Filtering. In WWW.Google Scholar
- Christopher J. Fox, Anany Levitin, and Thomas C. Redman. 1994. The Notion of Data and Its Quality Dimensions. Inf. Process. Manag. 30, 1 (1994), 9--20.Google ScholarDigital Library
- Youming Ge, Jiefeng Wu, Genan Dai, and Yubao Liu. 2019. Text Deduplication with Minimum Loss Ratio. In Proceedings of ICMLC.Google ScholarDigital Library
- C. Lee Giles, Kurt D. Bollacker, and Steve Lawrence. 1998. CiteSeer: An Automatic Citation Indexing System. In Proceedings of JCDL.Google ScholarDigital Library
- Thomas N. Herzog, Fritz J. Scheuren, and William E. Winkler. 2007. Data quality and record linkage techniques. Springer.Google ScholarDigital Library
- Omid Jafari, Preeti Maurya, Parth Nagarkar, et al. 2021. A Survey on Locality Sensitive Hashing Algorithms and their Applications.Google Scholar
- Petr Knoth and Zdenek Zdráhal. 2012. CORE: Three Access Levels to Underpin Open Access. D Lib Mag. 18, 11/12 (2012).Google Scholar
- Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. 2014. Mining of Massive Datasets (2nd ed.). Cambridge University Press, USA.Google Scholar
- Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kinney, and Daniel Weld. 2020. S2ORC: The Semantic Scholar Open Research Corpus. In Proceedings of ACL.Google ScholarCross Ref
- Patrice Lopez. 2009. GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications. In Proceedings of ECDL.Google ScholarCross Ref
- Jung-Ran Park. 2009. Metadata Quality in Digital Repositories: A Survey of the Current State of the Art. Cataloging & Classification Quarterly 47, 3--4 (2009).Google ScholarCross Ref
- Muhammad Roman, Abdul Shahid, Shafiullah Khan, Anis Koubaa, and Lisu Yu. 2021. Citation Intent Classification Using Word Embedding. IEEE Access 9 (2021).Google ScholarCross Ref
- Nees Jan van Eck and Ludo Waltman. 2017. Accuracy of citation data in Web of Science and Scopus. In Proceedings of ISSI.Google Scholar
- David Wadden, Shanchuan Lin, Kyle Lo, et al. 2020. Fact or Fiction: Verifying Scientific Claims. In Proceedings of EMNLP.Google ScholarCross Ref
- Kuansan Wang, Zhihong Shen, et al. 2020. Microsoft Academic Graph: When experts are not enough. Quantitative Science Studies 1, 1 (02 2020), 396--413.Google Scholar
- Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar, et al. 2020. CORD-19: The Covid-19 Open Research Dataset. CoRR abs/2004.10706 (2020).Google Scholar
- Kyle Williams and C. Lee Giles. 2013. Near Duplicate Detection in an Academic Digital Library. In Proceedings of DocEng.Google Scholar
- Jian Wu, Chen Liang, Huaiyu Yang, and C. Lee Giles. 2016. CiteSeerX data: semanticizing scholarly papers. In Proceedings of SBD@SIGMOD.Google Scholar
- Jian Wu, Pei Wang, Xin Wei, et al. 2020. Acknowledgement Entity Recognition in CORD-19 Papers. In Proceedings of SDP@EMNLP.Google ScholarCross Ref
- Feng Xia, Wei Wang, Teshome Megersa Bekele, and Huan Liu. 2017. Big Scholarly Data: A Survey. IEEE Transactions on Big Data 3, 1 (2017), 18--35.Google ScholarCross Ref
Index Terms
- Scholarly big data quality assessment: a case study of document linking and conflation with S2ORC
Recommendations
Preprocessing framework for scholarly big data management
AbstractBig data technologies have found applications in disparate domains. One of the largest sources of textual big data is scientific documents and papers. Scholarly big data has been used in numerous ways to develop innovative applications such as ...
Can big data improve firm decision quality? The role of data quality and data diagnosticity
AbstractAnecdotal evidence suggests that, despite the large variety of data, the huge volume of generated data, and the fast velocity of obtaining data (i.e., big data), quality of big data is far from perfect. Therefore, many firms defer ...
Highlights- Data quality (DQ) enhances data diagnosticity and firm decision quality.
- Big ...
Context-aware data quality assessment for big data
AbstractBig data changed the way in which we collect and analyze data. In particular, the amount of available information is constantly growing and organizations rely more and more on data analysis in order to achieve their competitive ...
Highlights- Data Quality assessment is a key success point for applications using big data.
Comments