short-paper

Open Access

Scholarly big data quality assessment: a case study of document linking and conflation with S2ORC

Authors:
Jian Wu

Old Dominion University

Old Dominion University
View Profile

,
Ryan Hiltabrand

Old Dominion University

Old Dominion University
View Profile

,
Dominik Soós

Old Dominion University

Old Dominion University
View Profile

,
C. Lee Giles

Pennsylvania State University

Pennsylvania State University
View Profile

DocEng '22: Proceedings of the 22nd ACM Symposium on Document EngineeringSeptember 2022Article No.: 16Pages 1–4https://doi.org/10.1145/3558100.3563850

Published:18 November 2022Publication History

DocEng '22: Proceedings of the 22nd ACM Symposium on Document Engineering

Pages 1–4

ABSTRACT

Recently, the Allen Institute for Artificial Intelligence released the Semantic Scholar Open Research Corpus (S2ORC), one of the largest open-access scholarly big datasets with more than 130 million scholarly paper records. S2ORC contains a significant portion of automatically generated metadata. The metadata quality could impact downstream tasks such as citation analysis, citation prediction, and link analysis. In this project, we assess the document linking quality and estimate the document conflation rate for the S2ORC dataset. Using semi-automatically curated ground truth corpora, we estimated that the overall document linking quality is high, with 92.6% of documents correctly linking to six major databases, but the linking quality varies depending on subject domains. The document conflation rate is around 2.6%, meaning that about 97.4% of documents are unique. We further quantitatively compared three near-duplicate detection methods using the ground truth created from S2ORC. The experiments indicated that locality-sensitive hashing was the best method in terms of effectiveness and scalability, achieving high performance (F1=0.960) and a much reduced runtime. Our code and data are available at https://github.com/lamps-lab/docconflation.

References

Saleh Rehiel Alenazi, Kamsuriah Ahmad, and Akeem Olowolayemo. 2017. A review of similarity measurement for record duplication detection. In ICEEI.Google Scholar
Yen Bui and Jung-ran Park. 2006. An assessment of metadata quality: A case study of the national science digital library metadata repository. In Proceedings of the Annual Conference of CAIS/Actes du congrès annuel de l'ACSI.Google Scholar
Li Cai and Yangyong Zhu. 2015. The Challenges of Data Quality and Data Quality Assessment in the Big Data Era. Data Sci. J. 14 (2015), 2.Google Scholar
MO Columb and MS Atkinson. 2015. Statistical analysis: sample size and power estimations. BJA Education 16, 5 (2015), 159--161.Google ScholarCross Ref
Abhinandan S. Das, Mayur Datar, Ashutosh Garg, and Shyam Rajaram. 2007. Google News Personalization: Scalable Online Collaborative Filtering. In WWW.Google Scholar
Christopher J. Fox, Anany Levitin, and Thomas C. Redman. 1994. The Notion of Data and Its Quality Dimensions. Inf. Process. Manag. 30, 1 (1994), 9--20.Google ScholarDigital Library
Youming Ge, Jiefeng Wu, Genan Dai, and Yubao Liu. 2019. Text Deduplication with Minimum Loss Ratio. In Proceedings of ICMLC.Google ScholarDigital Library
C. Lee Giles, Kurt D. Bollacker, and Steve Lawrence. 1998. CiteSeer: An Automatic Citation Indexing System. In Proceedings of JCDL.Google ScholarDigital Library
Thomas N. Herzog, Fritz J. Scheuren, and William E. Winkler. 2007. Data quality and record linkage techniques. Springer.Google ScholarDigital Library
Omid Jafari, Preeti Maurya, Parth Nagarkar, et al. 2021. A Survey on Locality Sensitive Hashing Algorithms and their Applications.Google Scholar
Petr Knoth and Zdenek Zdráhal. 2012. CORE: Three Access Levels to Underpin Open Access. D Lib Mag. 18, 11/12 (2012).Google Scholar
Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. 2014. Mining of Massive Datasets (2nd ed.). Cambridge University Press, USA.Google Scholar
Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kinney, and Daniel Weld. 2020. S2ORC: The Semantic Scholar Open Research Corpus. In Proceedings of ACL.Google ScholarCross Ref
Patrice Lopez. 2009. GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications. In Proceedings of ECDL.Google ScholarCross Ref
Jung-Ran Park. 2009. Metadata Quality in Digital Repositories: A Survey of the Current State of the Art. Cataloging & Classification Quarterly 47, 3--4 (2009).Google ScholarCross Ref
Muhammad Roman, Abdul Shahid, Shafiullah Khan, Anis Koubaa, and Lisu Yu. 2021. Citation Intent Classification Using Word Embedding. IEEE Access 9 (2021).Google ScholarCross Ref
Nees Jan van Eck and Ludo Waltman. 2017. Accuracy of citation data in Web of Science and Scopus. In Proceedings of ISSI.Google Scholar
David Wadden, Shanchuan Lin, Kyle Lo, et al. 2020. Fact or Fiction: Verifying Scientific Claims. In Proceedings of EMNLP.Google ScholarCross Ref
Kuansan Wang, Zhihong Shen, et al. 2020. Microsoft Academic Graph: When experts are not enough. Quantitative Science Studies 1, 1 (02 2020), 396--413.Google Scholar
Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar, et al. 2020. CORD-19: The Covid-19 Open Research Dataset. CoRR abs/2004.10706 (2020).Google Scholar
Kyle Williams and C. Lee Giles. 2013. Near Duplicate Detection in an Academic Digital Library. In Proceedings of DocEng.Google Scholar
Jian Wu, Chen Liang, Huaiyu Yang, and C. Lee Giles. 2016. CiteSeerX data: semanticizing scholarly papers. In Proceedings of SBD@SIGMOD.Google Scholar
Jian Wu, Pei Wang, Xin Wei, et al. 2020. Acknowledgement Entity Recognition in CORD-19 Papers. In Proceedings of SDP@EMNLP.Google ScholarCross Ref
Feng Xia, Wei Wang, Teshome Megersa Bekele, and Huan Liu. 2017. Big Scholarly Data: A Survey. IEEE Transactions on Big Data 3, 1 (2017), 18--35.Google ScholarCross Ref

Index Terms

Scholarly big data quality assessment: a case study of document linking and conflation with S2ORC
1. Information systems
  1. Data management systems
    1. Information integration
  2. Information systems applications
    1. Digital libraries and archives

Recommendations

Preprocessing framework for scholarly big data management
Abstract
Big data technologies have found applications in disparate domains. One of the largest sources of textual big data is scientific documents and papers. Scholarly big data has been used in numerous ways to develop innovative applications such as ...
Read More
Can big data improve firm decision quality? The role of data quality and data diagnosticity
Abstract
Anecdotal evidence suggests that, despite the large variety of data, the huge volume of generated data, and the fast velocity of obtaining data (i.e., big data), quality of big data is far from perfect. Therefore, many firms defer ...
Highlights
- Data quality (DQ) enhances data diagnosticity and firm decision quality.
- Big ...
Read More
Context-aware data quality assessment for big data
Abstract
Big data changed the way in which we collect and analyze data. In particular, the amount of available information is constantly growing and organizations rely more and more on data analysis in order to achieve their competitive ...
Highlights
- Data Quality assessment is a key success point for applications using big data.
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
DocEng '22: Proceedings of the 22nd ACM Symposium on Document Engineering
September 2022
118 pages
ISBN:9781450395441
DOI:10.1145/3558100
General Chairs:
Curtis Wigington
Adobe Systems Incorporated
,
Matthew Hardy
Adobe Systems Incorporated
,
Program Chairs:
Steven R. Bagley
University of Nottingham, United Kingdom
,
Steven Simske
Colorado State University, Fort Collins, CO
Copyright © 2022 Owner/Author
This work is licensed under a Creative Commons Attribution International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 November 2022
Check for updates
Author Tags
data quality
deduplication
document conflation
document linking
scholarly big data
Qualifiers
- short-paper
Conference

Acceptance Rates
Overall Acceptance Rate178of537submissions,33%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 156
  Total Downloads
- Downloads (Last 12 months)115
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Scholarly big data quality assessment: a case study of document linking and conflation with S2ORC

DocEng '22: Proceedings of the 22nd ACM Symposium on Document Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Preprocessing framework for scholarly big data management

Can big data improve firm decision quality? The role of data quality and data diagnosticity

Context-aware data quality assessment for big data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Scholarly big data quality assessment: a case study of document linking and conflation with S2ORC

DocEng '22: Proceedings of the 22nd ACM Symposium on Document Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Preprocessing framework for scholarly big data management

Can big data improve firm decision quality? The role of data quality and data diagnosticity

Context-aware data quality assessment for big data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media