Classification of research publications based on data from OpenAlex
Description
This data set contains an algorithmic classification of research publications based on data from OpenAlex. The classification is based on the OpenAlex snapshot released on November 21, 2023.
To build the classification, we used the so-called extended direct citation approach in combination with the Leiden algorithm. The source code of our software is available here. The classification covers the 71 million journal articles, proceedings papers, preprints, and book chapters in OpenAlex that were published between 2000 and 2023 and that are connected to each other by citation links. Based on 1715 million citation links, we built a three-level hierarchical classification. Each publication was assigned to a cluster at each of the three levels of the classification. Clusters consist of publications that are relatively strongly connected by citation links and that can therefore be expected to be topically related. At each level of the classification, a publication was assigned to only one cluster, which means clusters do not overlap.
The classification consists of 4521 micro clusters at the lowest (most granular) level, 917 meso clusters at the middle level, and 20 macro clusters at the highest (least granular) level. We also algorithmically linked each cluster in the classification to one or more of the following five broad main fields: biomedical and health sciences, life and earth sciences, mathematics and computer science, physical sciences and engineering, and social sciences and humanities.
We used the Updated GPT 3.5 Turbo large language model, developed by OpenAI, to label the 4521 micro clusters at the lowest level in the classification. The source code of our software can be found here.
See this blog post for more information about the classification.
The classification, including the labels of the micro clusters, is available in the following tab-delimited files.
clustering.tsv
- work_id
- doi
- macro_cluster_id
- meso_cluster_id
- micro_cluster_id
main_field.tsv
- main_field_id
- main_field
macro_cluster.tsv
- macro_cluster_id
- macro_cluster
- n_works
macro_cluster_main_field.tsv
- macro_cluster_id
- main_field_seq
- main_field_id
- weight
- is_primary_main_field
meso_cluster.tsv
- meso_cluster_id
- meso_cluster
- parent_macro_cluster_id
- n_works
meso_cluster_main_field.tsv
- meso_cluster_id
- main_field_seq
- main_field_id
- weight
- is_primary_main_field
meso_cluster_source.tsv
- meso_cluster_id
- source_seq
- source_id
- n_works
micro_cluster.tsv
- micro_cluster_id
- micro_cluster
- short_label
- long_label
- keywords
- summary
- wikipedia_url
- parent_macro_cluster_id
- parent_meso_cluster_id
- n_works
micro_cluster_main_field.tsv
- micro_cluster_id
- main_field_seq
- main_field_id
- weight
- is_primary_main_field
micro_cluster_keyword.tsv
- micro_cluster_id
- keyword_seq
- keyword
micro_cluster_source.tsv
- micro_cluster_id
- source_seq
- source_id
- n_works
Files
classification_openalex_2023nov.zip
Files
(1.0 GB)
Name | Size | Download all |
---|---|---|
md5:17dfc3d0f9d7f115049af0dae4a72099
|
1.0 GB | Preview Download |