FAIR Data Digest #13
supporting many use cases with FAIR publication metadata of more than 249 million publications
Hi everyone,
this last summer-break edition ties in with the previous edition. Last week I’ve wrote over research to make publication metadata from the free publishing service CEUR-WS available in a FAIR fashion, such that the data can be explored. Today I focus on a smaller initiative to make metadata of 50 papers available, as well as research to make metadata about astonishing 249 million publications available! Both are necessary.
🔍 Usually there are many players in a data ecosystem. Anyone of those players can contribute towards FAIR data. Recently, the organizers of the upcoming SEMANTiCS conference (Q59916981) wrote a blog post about how they embrace Open Science best practices: with a semi-automatic pipeline and manual curation. I briefly talk about their recent blog post and link it for you.
🧪 What’s the trend of publications around a certain topic for an institution? Obtaining such strategically important information is cumbersome. Information must be curated or alternatively must be retrieved via paid services from publishing houses. But those data silos might be difficult to access or may just be discontinued, such as the Microsoft Academic Graph (Q62056662) (source). In today’s Science Spotlight I present a paper about an operative open data platform that makes metadata of 249 million publications available under the open license CC0 (thus anyone can freely use their data). I will summarize their incredible work and show that nobody’s perfect.
🔍 Insights: Open Science best practices at SEMANTiCS conference
Let’s start small. There are many players in the academic world ranging from conferences organized by researchers to publishers, academic institutions or funding bodies. Anyone of those players can contribute to a FAIR data ecosystem.
Recently the organizers of the SEMANTiCS conference announced how they eat their own dog food and embrace Open Science best practices. They created and published FAIR data about their 50 accepted papers, involving automatic processing and manual curation.
This allowed them to create insightful statistics. For example, anyone can see that 26 out of the 50 accepted papers are resource papers, meaning that they present a resource such as a dataset or a tool that was deemed relevant by the peer-reviewing. Thanks to the statistics, interpretations can be made. According to the organizers, it does not seem to be widespread practice to share persistent identifiers for code - which becomes apparent thanks to the statistics.
Other Semantic-Web related conferences such as ESWC (Q17012957) also publish FAIR metadata (source). If more conferences would do this, we would get high quality metadata at the source which surely would improve the data quality of downstream applications and data aggregation.
So far they use the data as found in the conference management system EasyChair (Q1278323). Unfortunately, there are no author identifiers which makes downstream tasks more difficult due to ambiguous names. See also the following item in the newsletter to see why this could become a problem. Curious? Check out their short blog post to learn about their semi-automatic workflow with manual curation.
🧪 Science Spotlight: Metadata of 249 million publications in SemOpenAlex
Staying on top of the research game is difficult, because millions of articles are published every year. Whether you look for a simple information or more analytical data, finding it might be relevant for strategic decisions. More than that, all the data-driven services we are used to nowadays, rely on high quality data. Believe it or not, since recently you can access metadata about 249 million publications under the open CC0 license.
Since the de-continuation of the Microsoft Academic Graph, the OpenAlex initiative (Q107507571) makes large amounts of metadata about scientific publications freely available. However, as a reader of this newsletter you probably know that having data and having FAIR data makes a huge difference! (Spoiler: the latter is better).
A paper that will be presented at the upcoming International Semantic Web Conference 2023 (Q119153957), or in short ISWC, presents SemOpenAlex! A web platform and data dump that is created monthly on top of the OpenAlex data. It makes their data truly FAIR by following the Linked Open Data principles.
A publicly available resource with many use cases
As outlined by the authors, the presented SemOpenAlex Knowledge Graph that can be accessed and queried by anyone has several use cases.
Scholarly Big Data analytics
Large-Scale scientific impact quantification
Scholarly search and recommender systems
Semantic scientific publishing
Research project management and modeling
Groundwork for scientific publishing in the future
Knowledge-guided language models
Benchmarking
Human-in-the-loop seems necessary
I looked for myself in their database and found 3 entries (see image below). The first match is the most comprehensive record covering almost all of my publications. The two others are duplicates where each is linked to only one publication. Funnily, it is the same publication linked to both duplicates, yet also that publication is a duplicate. And even more ironically, the duplicate is a publication about Linked Open Data validity!
I would love to signal the duplicates such that they can be merged and that the quality can be improved. Unfortunately in the current setting of the platform this is not possible. I guess this is because they just use data from OpenAlex, so I would have to figure out how to correct my data there at the source. Anyway, OpenAlex is also a data aggregator which makes it more difficult. This can probably be avoided when using unique author identifiers at all time (an issue also apparent at the SEMANTiCS data outlined in the previous newsletter item).
Are you intrigued to learn more about SemOpenAlex? Check out their openly available paper DOI: 10.48550/arXiv.2308.03671 or dive into their data yourself by using their web application!
That’s it for this week of the FAIR Data Digest. I hope you found the content interesting. Don’t forget to share or subscribe. See you next week!
Sven