Keywords

1 Introduction

With the increasing number of scientific publications, staying up-to-date with current research presents a significant challenge. For instance, in 2022 alone, more than 8 million scientific publications were registered [1]. To explore related scholarly entities such as authors and institutions, researchers rely on a range of methods from search interfaces to recommendation systems [2, 3]. One effective way to model the underlying scholarly data is to represent it as an RDF knowledge graph (KG). Doing so facilitates standardization, visualization, and interlinking with Linked Data resources [4]. Consequently, scholarly KGs play a pivotal role in transforming document-centric scholarly data into interconnected and machine-actionable knowledge structures [2].

However, available scholarly KGs have one or several of the following limitations. Firstly, they rarely contain an exhaustive catalog of publications across all disciplines [5]. Secondly, they often cover only certain disciplines, such as computer science [6]. Thirdly, they are not regularly updated, rendering many analyses and business models obsolete [7]. Fourthly, they often contain usage restrictions [8]. Lastly, even if they fulfill these requirements, they are not available according to W3C standards such as RDF [1, 9]. These issues hinder the application of scientific KGs on a broad scale, such as in comprehensive search and recommender systems, or for scientific impact quantification. For instance, the Microsoft Academic Graph was discontinued in 2021 [10], which hinders further updates to its derivative in RDF, the Microsoft Academic Knowledge Graph (MAKG) [7]. This leaves a gap that the novel OpenAlex dataset aims to fill [1]. However, the data in OpenAlex is not available in RDF and does not comply with Linked Data Principles [11]. Consequently, OpenAlex cannot be considered a KG, which makes semantic queries, integration into existing applications, or linking to additional resources non-trivial. At first glance, integrating scholarly data about scientific papers into Wikidata and thus contributing to the WikiCite initiative may seem like an obvious solution. However, apart from the dedicated schema, the volume of the data is already so large that the Blazegraph triplestore which is used in the Wikidata Query Service reaches its capacity limit, preventing any integration [12] (see Sect. 2).

In this paper, we introduce SemOpenAlex, an extremely large RDF dataset of the academic landscape with its publications, authors, sources, institutions, concepts, and publishers. SemOpenAlex consists of more than 26 billion semantic triples and includes over 249 million publications from all academic disciplines. It is based on our rich ontology (see Sect. 3.1) and includes links to other LOD sources such as Wikidata, Wikipedia, and the MAKG. To ensure easy and efficient use of SemOpenAlex’s integration with the LOD cloud, we provide a public SPARQL endpoint. In addition, we provide a sophisticated semantic search interface that allows users to retrieve real-time information about contained entities and their semantic relationships (e.g., displaying co-authors or an author’s top concepts – information, which is not directly contained in the database but obtained through semantic reasoning). We also provide the full RDF data snapshots to enable big data analysis. Due to the large size of SemOpenAlex and the ever-increasing number of scientific publications being integrated into SemOpenAlex, we have established a pipeline using AWS for regularly updating SemOpenAlex entirely without any service interruptions. Additionally, to use SemOpenAlex in downstream applications, we trained state-of-the-art knowledge graph entity embeddings. By reusing existing ontologies whenever possible, we ensure system interoperability in accordance with FAIR principles [13] and pave the way for the integration of SemOpenAlex into the Linked Open Data Cloud. We fill the gap left by the discontinuation of MAKG by providing monthly updates that facilitate ongoing monitoring of an author’s scientific impact, tracking of award-winning research, and other use cases using our data [14, 15]. By making SemOpenAlex free and unrestricted, we empower research communities across all disciplines to use the data it contains and integrate it into their projects. Initial use cases and production systems that use SemOpenAlex already exist (see Sect. 5).

Overall, we make the following contributions:

  1. 1.

    We create an ontology for SemOpenAlex reusing common vocabularies.

  2. 2.

    We create the SemOpenAlex knowledge graph in RDF, covering 26 billion triples, and provide all SemOpenAlex data, code, and services for public access at https://semopenalex.org/:

    1. (a)

      We provide monthly updated RDF data snapshots free of charge on AWS S3 at s3://semopenalex (via browser: https://semopenalex.s3.amazonaws.com/browse.html), accepted as AWS Open Data project.Footnote 1

    2. (b)

      We make all URIs of SemOpenAlex resolvable, allowing SemOpenAlex to be part of the Linked Open Data cloud.Footnote 2

    3. (c)

      We index all data in a triple store and make it publicly available via a SPARQL endpoint (https://semopenalex.org/sparql).

    4. (d)

      We provide a semantic search interface including entity disambiguation to access, search, and visualize the knowledge graph and its statistical key figures in real time.

  3. 3.

    We provide state-of-the-art knowledge graph embeddings for the entities represented in SemOpenAlex using high-performance computing.

In the following, we first discuss related work (see Sect. 2) and describe the SemOpenAlex ontology and RDF data (see Sect. 3), before presenting the SemOpenAlex entity embeddings (see Sect. 4). Subsequently, we outline existing and potential use cases (see Sect. 5), before we conclude the paper (see Sect. 6).

2 Related Work

A comparison of scholarly RDF datasets is presented in Table 1. It is obvious from the table that SemOpenAlex (1) is the only RDF KG that follows the Linked Data Principles, (2) is fully open, (3) contains a vast amount of bibliographic information from all scientific disciplines, and (4) is regularly updated, making it a valuable resource in various contexts (see Sect. 5).

Table 1. Statistical comparison of scholarly RDF datasets.

The OpenAIRE Research Graph provides open and free access to metadata of 145 million publications, datasets, and software via an API, a SPARQL endpoint, and database dumps [16]. However, not only is the number of publications significantly lower than in SemOpenAlex but on May 8, 2023, OpenAIRE stopped its LOD services and closed the SPARQL endpoint.Footnote 3

WikiCiteFootnote 4 has incorporated bibliographic metadata into Wikidata, but SemOpenAlex covers considerably more metadata (e.g., 249M papers vs. 42M), including additional properties such as papers’ abstracts. While using Wikidata as a central KG and regularly importing SemOpenAlex information seems logical, the scalability of the Blazegraph triplestore backend which hosts the Wikidata Query Service is limited, and Wikimedia has announced a plan to delete scholarly articles in case of bulk imports.Footnote 5

AceKG [17] is a database containing 62 million publications, along with academic details related to authors, fields of study, venues, and institutes. AceKG data is modeled in RDF. However, unlike our approach, it does not use existing vocabularies, lacks a publicly available triple store, and does not offer continuous updates. All data is sourced from a company’s database.

OpenCitations focuses on publications and their citation relationships [18]. Specifically, it covers metadata about publications and their citations, but not descriptions of affiliated organizations (institutions) or hosting conferences and journals (venues). OpenCitations includes several datasets, including the OpenCitations Index of Crossref Open DOI-to-DOI Citations (COCI) with 76 million items to date, and smaller datasets such as the OpenCitations Corpus (OCC) and OpenCitations in Context Corpus (CCC) [5].

The Microsoft Academic Knowledge Graph (MAKG) is based on the Microsoft Academic Graph (MAG), containing information on publications, authors, institutions, venues, and concepts [7, 19]. The MAKG has high coverage across scientific domains and has enabled novel use cases. However, it will no longer be updated due to lack of source data [10]. Several analyses have assessed the MAG and MAKG, revealing the need for improvements in areas such as citation accuracy, concept assignment, and disambiguation [20,21,22,23]. Compared to MAKG, SemOpenAlex provides a similar schema, but provides fresh data that is in addition cleaned by an author name disambiguation provided by OpenAlex and a neater mapping of concepts to papers using the Simple Knowledge Organization System (SKOS) ontology [24, 25].

Further notable scholarly KGs are the DBLP KGFootnote 6 and the Open Research Knowledge Graph (ORKG) [26]. DBLP provides only high-quality metadata about computer science publications, resulting in a coverage of roughly 6 million publications [6]. ORKG is a project that aims to provide a KG infrastructure for semantically capturing and representing the content of research papers [2, 27]. ORKG contains a relatively small set of more than 25,000 publications, however, with many RDF statements, indicating considerable semantic richness. Due to their different focuses, SemOpenAlex can complement ORKG as an LOD data source: while SemOpenAlex provides a broad basis of metadata about a massive amount of publications and related entities in RDF (with a focus on high coverage, see Table 2), ORKG focuses on modeling scientific contributions as well as methodology aspects, which are manually curated (with a focus on high data quality and key insights of papers).

3 SemOpenAlex

In the following, we describe the design of the SemOpenAlex ontology (Sect. 3.1) and the process of generating SemOpenAlex data (Sect. 3.2). We also explain how we publish and enable user interaction with the data (Sect. 3.3), and present key statistics of the KG (Sect. 3.4). Furthermore, we evaluate to what extent SemOpenAlex meets linked data set descriptions and rankings (Sect. 3.5).

Table 2. SemOpenAlex entity types and number of instances (as of March 2023).

3.1 Ontology of SemOpenAlex

We developed an ontology following the best practices of ontology engineering reusing as much existing vocabulary as possible. An overview of the entity types, the object properties, and the data type properties is provided in Fig. 1. Overall, the ontology of SemOpenAlex covers 13 entity types, including the main entity types works, authors, institutions, sources, publishers and concepts, as well as 87 relation types.

Table 3. Used ontologies, their corresponding prefixes and namespace.
Fig. 1.
figure 1

Ontology of SemOpenAlex.

We reused the vocabularies listed in Table 3. To describe publications, researchers, and institutions, we leveraged established Semantic Publishing and Referencing (SPAR) ontologies [28], such as FaBiO and CiTO. FaBiO is used to describe specific identifiers such as a work’s PubMedID, while CiTO represents citing relationships between works. For bibliographic metadata, such as a work’s publication date and abstract, we used the Dublin Core ontology (DCterms). To represent more generic features and relations, we relied on cross-domain ontologies such as DBpedia and the W3 Organization Ontology (W3 ORG). The works are classified using a concept hierarchy, which we represented in a SKOS vocabulary of 65k SKOS concepts and semantic relations (skos:broader and skos:related). The concepts are further linked with Wikidata entities, allowing for additional interoperability and providing multi-lingual labels.

3.2 Knowledge Graph Creation Process

The raw OpenAlex data was presumably designed for data processing (e.g., abstracts are provided as inverted index and not provided as one string). To create an RDF KG based on the OpenAlex dump files, major changes in the data formatting and the data modeling are necessary. In the following, we outline the essential steps of this transformation process.

Transformation. We carry out a number of distinct steps for the transformation that can be reproduced via the code in our GitHub repository.Footnote 7

  1. 1.

    Data Preprocessing: We download the OpenAlex snapshot in compressed .jsonl format from its AWS S3 bucket and use the Python multiprocessing package for efficient parallel processing of the large amount of data. To ensure valid triple generation according to the W3C RDF 1.1 Concepts and Abstract SyntaxFootnote 8 later, we remove problematic characters from literal values, such as non-escaped backslashes in URLs or newlines in publication titles. Additionally, we convert the abstracts, which are included in OpenAlex as an inverted index, to plain text to improve accessibility.

  2. 2.

    RDF Generation: We transform the preprocessed data from JSON into RDF according to the ontology shown in Fig. 1. For the generation of the triples, we draw on the rdflib Python package,Footnote 9 which offers functionality to handle, process and validate RDF data. During triple serialization, we create a buffer subgraph that is written once a fixed number of statements is reached to reduce the number of I/O operations. In total, we generate 26,401,183,867 RDF triples given the data snapshot as of 2023-03-28.

  3. 3.

    Compression and Deployment: The RDF data generated for SemOpenAlex takes up 1.7 TB in the TriG formatFootnote 10 when uncompressed. To make the data more manageable, we compress it into .gz archives, resulting in a reduction of over 80% in file size to 232 GB. These compressed files are then imported into the GraphDB triple store and made available for download as an open snapshot. Additionally, we provide a data sample on GitHub.

Update Mechanism. To ensure that SemOpenAlex remains up-to-date, we perform the transformation process described earlier on a monthly basis, which involves downloading the latest OpenAlex snapshot. This enables us to observe temporal dynamics in the data, and ensures that SemOpenAlex provides the most recent information available. The updated version of the data is available through all three access points (RDF dump, SPARQL endpoint, and visual interface). The update process is semi-automated and takes approximately five days to complete on an external server instance. We use one AWS instance to provide SemOpenAlex services and one instance to process the next SemOpenAlex release. Changes to SemOpenAlex data resulting from changes in the raw OpenAlex files are tracked using announcements via the OpenAlex mailing list. Several adaptations have been performed in this way in the past.

Fig. 2.
figure 2

Author overview page for A.M. Turing, accessible at https://semopenalex.org/author/A2430569270.

3.3 Data Publishing and User Interaction

Our KG is publicly accessible at https://semopenalex.org/. We utilize the metaphactory knowledge graph platform [29] on top of a GraphDB triple store to deploy the KG. metaphactory serves as a Linked Data publication platform and ensures that the URIs of SemOpenAlex are fully resolvable. The data is published in machine-readable RDF formats as well as human-readable HTML-based templates using content negotiation. Figure 2 displays the page for the URI https://semopenalex.org/author/A2430569270.

Among other features, the interface provided for SemOpenAlex enables users to: (1) access SemOpenAlex through a search interface with filtering options; (2) visualize arbitrarily large sub-graphs for objects and relations of interest; (3) formulate and execute SPARQL queries to retrieve objects from the graph using a provided SPARQL endpoint; (4) examine the ontology of SemOpenAlex; (5) obtain key statistics for each object in SemOpenAlex in a dashboard, as shown in the screenshot in Fig. 2; (6) assess the underlying multi-level concept hierarchy; and (7) interact with further linked entities such as co-authors or concepts and access external resources such as links to Wikidata.

3.4 Key Statistics of SemOpenAlex and Example SPARQL Queries

Fig. 3.
figure 3

Number of publications published in machine learning and natural language processing by researchers from Karlsruhe Institute of Technology.

Table 4. Number of institution for the countries with the most institutions.
Fig. 4.
figure 4

Distribution of institution types.

In this subsection, we present several statistics that we generated based on queries using our SPARQL endpoint. We provide the queries on GitHub.

Figure 3 shows the number of papers published in the field of machine learning and natural language processing by researchers from Karlsruhe Institute of Technology from 2000 to 2021. While the number of machine learning papers received a sharp increase from 2015, the number of papers in the field of natural language processing increased at a rather constant rate. SemOpenAlex enables institutions to create such relevant key figures and trends in the context of strategic controlling in a simple and cost-free way.

SemOpenAlex covers the worldwide scientific landscape and contains publications from institutions around the globe. In total, institutions from 225 different countries are included. The 8 countries with the highest number of institutions are shown in Table 4.

By distinguishing between eight types of institutions, SemOpenAlex enables differentiated data analyses. In Fig. 4, we can see the distribution of the 108,618 unique institutions across the different types. We can see that the majority of the organizations are companies, followed by educational and nonprofit institutions.

Listing 1 shows an example of how SemOpenAlex can be queried with SPARQL. This query retrieves the top 100 most cited papers in the field of semantic web, along with their citation counts and first authors. It is worth noting that this query cannot be executed on other scholarly KGs like the MAKG, as they do not cover information about the author’s position for a given paper.

figure a

3.5 Linked Data Set Descriptions and Ratings

Following the licensing model of the underlying OpenAlex data,Footnote 11 we provide all SemOpenAlex data under the CC0 license, which grants users the right to freely build upon, enhance, and reuse the works for any purpose without restriction, paving the way for other researchers and software engineers to build upon SemOpenAlex in any context. The RDF data files are available for unrestricted and free download as they are hosted with the AWS Open Data program.Footnote 12

We can categorize SemOpenAlex according to the two kinds of 5-star rating schemes in the Linked Data context:

  • Tim Berners-Lee’s 5-star deployment scheme for Open DataFootnote 13: Our SemOpenAlex RDF dataset is a 5-star data set according to this scheme, because we provide our data in RDF (leading to 4 stars) and the (1) entity URIs are linked to Wikidata, Wikipedia and the MAKG and (2) our vocabulary URIs to other vocabularies (leading to 5 stars).

  • Linked Data vocabulary star rating [30]: This rating is intended to rate the use of vocabulary within Linked (Open) Data. By providing a turtle file, by linking our vocabulary to other vocabularies (see the SPAR ontologies), we are able to provide the vocabulary with 4 stars.

Aside from the SemOpenAlex RDF documents, we provide the following linked data set descriptions (all available at https://semopenalex.org/):

  • Turtle: We provide our ontology as a Turtle file describing the used classes, object properties, and data type properties.

  • VoID: We provide a VoID file to describe our linked data set with an RDF schema vocabulary.

4 Graph Embeddings for SemOpenAlex

Apart from creating and providing the SemOpenAlex data set and services (e.g., the SPARQL endpoint), we computed embeddings for all SemOpenAlex entities. Entity embeddings have proven to be useful as implicit knowledge representations in a variety of scenarios, as we describe in Sect. 5. Based on the SemOpenAlex data in RDF, we trained entity embeddings based on several state-of-the-art embedding techniques and compared the performance of the respective results with regard to link prediction tasks. Specifically, we applied the following approaches: TransE [31], DistMult [32], ComplEx [33], a GraphSAGE neural network [34], and a graph attention network [35]. To address the nontrivial challenges associated with training on SemOpenAlex as a very large knowledge graph, we employed the Marius framework [36]. MariusFootnote 14 is designed to optimize resource utilization by pipelining hard disk, CPU, and GPU memory during training, thereby reducing idle times. In our evaluation, we opted for a configuration of 100 embedding dimensions, a batch size of 16,000, and trained for 3 epochs on a high-performance computing system (bwUniCluster 2.0) using Python 3.7, Marius 0.0.2, PyTorch 1.9.1, and CUDA 11.2.2. These parameters are in line with previous research on large-scale entity embeddings [24].

The computational effort required for the different embedding techniques varied, with the GraphSAGE and the graph attention network approaches requiring the most memory. These methods used up to 716 GB of CPU RAM and took the longest time to train, with each epoch taking roughly 24 h. Despite the resource-intensive nature of the GraphSAGE and graph attention network approaches, DistMult yielded the highest mean reciprocal rank (MRR) score in our link prediction evaluation (see all evaluation results on GitHub). Therefore, we provide the DistMult-based embedding vectors for all entities online.Footnote 15

5 Use Cases of SemOpenAlex

Scholarly KGs have proven to be a valuable data basis for various use cases and scenarios, such as analyzing research dynamics between academia and industry [37], scientific impact quantification [14, 38], and linking research data sets to publications [39]. This is also reflected in the high number of citations of the reference publications of the MAG [40] and MAKG [7].Footnote 16 In the following, we focus on existing and potential use cases of SemOpenAlex.

Scholarly Big Data Analytics and Large-Scale Scientific Impact Quantification. SemOpenAlex can serve for scientific impact quantification and innovation management. For instance, OpenAlex has been utilized as a comprehensive and reliable data source to rank researchers and institutions worldwide on research.com.Footnote 17 InnoGraph is a new project that leverages OpenAlex to represent innovation ecosystems as a KG for innovation management and forecasting [41]. By using SemOpenAlex as underlying database for such projects and efforts, the need to deal with cumbersome data integration issues can be reduced. Currently, universities such as KIT rely on paid scholarly services like those from Springer Nature for measuring their performance and ranking as a university [42]. However, in the future, these institutions can use SemOpenAlex as a free database to run analytics and evaluations on all relevant publications and associated entities.

Scholarly Search and Recommender Systems. Recommendation systems – both content-based and collaborative filtering-based – have become increasingly important in academia to help scientists navigate the overwhelming amount of available information resulting from the exponential increase in the number of publications. In this paper, we provide entity embeddings for nearly all existing entities in the scientific landscape, which can be used directly to build state-of-the-art recommender systems. These systems can recommend items such as papers to read and cite, as well as venues and collaborators [43]. SemOpenAlex can be utilized to make these recommendations explainable, as symbolic information from the KG can be shown to the user. Due to SemOpenAlex’s rich ontology, including various entity types, SemOpenAlex can serve as a realistic dataset for training and evaluating state-of-the-art graph neural networks designed for heterogeneous information networks and with a specific focus on scalability and semantics. Moreover, our rich KG can be utilized to provide recommendations in complex scenarios, such as finding the optimal consortium for large, possibly interdisciplinary research projects. In the context of semantic search, SemOpenAlex can be used for entity linking, annotating scientific texts [44] or tables [45] for enhanced search capabilities.

Semantic Scientific Publishing. SemOpenAlex is a part of the Linked Open Data Cloud and contains links to other data sources such as Wikidata, Wikipedia, and MAKG. As a result, it significantly contributes to the use of linked data in areas such as digital libraries and information retrieval [46]. SemOpenAlex has a unique selling point among available scientific knowledge graphs, with its coverage of publications worldwide and across all scientific disciplines, totaling around 250 million publications (see Table 2), and its regular updates. SemOpenAlex can serve as a central catalog for publications, researchers, and research artifacts, to which other data repositories and KGs can link. This creates an opportunity to use SemOpenAlex as a basis for modeling scientific artifacts, such as datasets, scientific methods, and AI models, and thus beyond SemOpenAlex’ current scope. This information may be modeled in separate, interlinked KGs or as part of SemOpenAlex in the future. For instance, the Data Set Knowledge Graph [39], which currently links 600,000 publications in which datasets are mentioned to the MAKG, can now link datasets to papers in SemOpenAlex. Similarly, semantic representations of datasets and scientific methods [47], as well as representations of scientific facts and claims mentioned in full-text articles [48], can be linked to publications and authors in SemOpenAlex to provide rich context information as explanations of academic recommender systems. Furthermore, links between SemOpenAlex and KGs modeling AI models and their energy consumption, such as the Green AI Knowledge Graph [49], can be used to combine previously isolated data for performing complex analytics. In this way, questions of strategic controlling, such as “How green are the AI models developed at my institution?” [49], can be automatically answered. Finally, it makes sense to link full-text paper collections to SemOpenAlex, for instance, to leverage its concept schema, since SemOpenAlex applies concept tags to all its papers published globally and across all scientific fields. An excellent example of an existing paper collection linked to SemOpenAlex is unarXive 2022 [50], sourced from two million arXiv papers.

Research Project Management and Modeling. KGs have become increasingly important in supporting research projects by providing a structured representation of various research entities and their relationships [51]. These project-specific KGs encapsulate a diverse range of research entities, such as topics, methods, tasks, materials, organizations, researchers, their skills, interests, and activities, as well as research outputs and project outcomes. To facilitate the development and support of KGs for research projects, SemOpenAlex serves as a knowledge hub by providing existing data on project participants and relevant research. Researchers can use tools and vocabularies provided by the Competency Management Ontology [52] to seamlessly describe their skills, current research interests, and activities in terms of the entities already contained in SemOpenAlex. Moreover, SemOpenAlex’s concept hierarchy allows for the construction of ontologies for specific research domains, streamlining research tasks such as performing a state-of-the-art analysis for a research area. Existing resources from SemOpenAlex can be integrated into KG-based project bibliographies, enhancing collaboration between researchers through resource sharing.

SemOpenAlex has already been used to provide a comprehensive and structured overview of research projects. In particular, personalized dashboards have been created by metaphacts that display recently added publications from SemOpenAlex that are relevant to the current research context. Newly created resources within a project, such as research papers and datasets, can also be described and linked to SemOpenAlex. Ultimately, published results become a valuable part of SemOpenAlex.

Groundwork for Scientific Publishing in the Future. One can envision that the working style of researchers will considerably change in the next few decades [53, 54]. For instance, publications might not be published in PDF format any more, but in either an annotated version of it (with information about the claims, the used methods, the data sets, the evaluation results, and so on) or in the form of a flexible publication form, in which authors can change the content and, in particular, citations, over time. SemOpenAlex can be easily combined with new such data sets due to its structure in RDF. Furthermore, ORKG is an ongoing effort that targets the semantic representation of papers and their scientific contributions. We argue that SemOpenAlex can be used as data basis for ORKG in the sense that with SemOpenAlex, users do not need to take care of first creating papers and authors in the ORKG, but to directly import or link the corresponding information from SemOpenAlex, which has its focus on being a comprehensive KG covering all scientific publications worldwide.

Knowledge-Guided Language Models. Large language models, including ChatGPT and GPT-4, have been criticized for their lack of explainability and their failure to provide reliable in-text citations to reference literature. Often, when citations are provided, they are incorrect and reflect “hallucinations”. In this context, SemOpenAlex represents a valuable repository for guiding language models in providing reliable references to scientific literature and as a basis for text-editing generative models. With metadata of 250 million scientific works, SemOpenAlex can serve as a valuable resource for source attribution and improving the accuracy and quality of scientific writing generated by these models.

Benchmarking. SemOpenAlex is a prime example of big data, fulfilling the “4 V’s” criteria: it is very large, with a wide variety of information types (including papers, authors, institutions, venues, and various data formats), contains uncertainties, and is updated periodically. This makes it suitable for benchmarking systems and approaches, particularly in the context of querying large, realistic KGs [55]. In fact, the MAKG has already been used for this purpose [56] and we expect SemOpenAlex to follow suit.

6 Conclusions

In this paper, we presented a comprehensive RDF dataset with over 26 billion triples covering scholarly data across all scientific disciplines. We outlined the creation process of this dataset and discussed its characteristics. Our dataset supports complex analyses through SPARQL querying. By making the SPARQL endpoint publicly available and the URIs resolvable, we enriched the Linked Open Data cloud with a valuable source of information in the field of academic publishing. We offer RDF dumps, linked dataset descriptions, a SPARQL endpoint, and trained entity embeddings online at https://semopenalex.org/. In the future, we plan to incorporate metadata about funding programs to enable in-depth and comprehensive evaluations of funding lines of governments and institutions [51, 57, 58].