Published January 24, 2024 | Version 2023nov
Dataset Open

Classification of research publications based on data from OpenAlex

  • 1. Centre for Science and Technology Studies, Leiden University

Description

This data set contains an algorithmic classification of research publications based on data from OpenAlex. The classification is based on the OpenAlex snapshot released on November 21, 2023.

To build the classification, we used the so-called extended direct citation approach in combination with the Leiden algorithm. The source code of our software is available here. The classification covers the 71 million journal articles, proceedings papers, preprints, and book chapters in OpenAlex that were published between 2000 and 2023 and that are connected to each other by citation links. Based on 1715 million citation links, we built a three-level hierarchical classification. Each publication was assigned to a cluster at each of the three levels of the classification. Clusters consist of publications that are relatively strongly connected by citation links and that can therefore be expected to be topically related. At each level of the classification, a publication was assigned to only one cluster, which means clusters do not overlap.

The classification consists of 4521 micro clusters at the lowest (most granular) level, 917 meso clusters at the middle level, and 20 macro clusters at the highest (least granular) level. We also algorithmically linked each cluster in the classification to one or more of the following five broad main fields: biomedical and health sciences, life and earth sciences, mathematics and computer science, physical sciences and engineering, and social sciences and humanities.

We used the Updated GPT 3.5 Turbo large language model, developed by OpenAI, to label the 4521 micro clusters at the lowest level in the classification. The source code of our software can be found here.

See this blog post for more information about the classification.

 

The classification, including the labels of the micro clusters, is available in the following tab-delimited files.

 

clustering.tsv

  • work_id
  • doi
  • macro_cluster_id
  • meso_cluster_id
  • micro_cluster_id

 

main_field.tsv

  • main_field_id
  • main_field

 

macro_cluster.tsv

  • macro_cluster_id
  • macro_cluster
  • n_works

 

macro_cluster_main_field.tsv

  • macro_cluster_id
  • main_field_seq
  • main_field_id
  • weight
  • is_primary_main_field

 

meso_cluster.tsv

  • meso_cluster_id
  • meso_cluster
  • parent_macro_cluster_id
  • n_works

 

meso_cluster_main_field.tsv

  • meso_cluster_id
  • main_field_seq
  • main_field_id
  • weight
  • is_primary_main_field

 

meso_cluster_source.tsv

  • meso_cluster_id
  • source_seq
  • source_id
  • n_works

 

micro_cluster.tsv

  • micro_cluster_id
  • micro_cluster
  • short_label
  • long_label
  • keywords
  • summary
  • wikipedia_url
  • parent_macro_cluster_id
  • parent_meso_cluster_id
  • n_works

 

micro_cluster_main_field.tsv

  • micro_cluster_id
  • main_field_seq
  • main_field_id
  • weight
  • is_primary_main_field

 

micro_cluster_keyword.tsv

  • micro_cluster_id
  • keyword_seq
  • keyword

 

micro_cluster_source.tsv

  • micro_cluster_id
  • source_seq
  • source_id
  • n_works

Files

classification_openalex_2023nov.zip

Files (1.0 GB)

Name Size Download all
md5:17dfc3d0f9d7f115049af0dae4a72099
1.0 GB Preview Download