Published June 1, 2023 | Version 2.0
Dataset Open

Softcite Dataset Version 2

  • 1. University of Texas at Austin
  • 2. science-miner

Description

This is the version 2.0 of the Softcite Dataset, a corpus of currently 4971 scientific articles with software mention annotations. This is a gold standard corpus, resulting from multi-stage annotations by a team of annotators and reconciliation phases by curators to solve disagreements. The dataset is available under CC-BY license.

For the first versions of this dataset, see here and here. This new version is dedicated to new annotation iterations on the dataset and is independent from the previous repositories for simplification.

The working space for this dataset is at https://github.com/softcite/softcite_dataset_v2

Description of the dataset

The Softcite dataset consist of the 4,971 full-texts in English (available in Open Access under CC-BY license), half in Life Sciences and half in Economics, for a total of around 46 million tokens.

Annotations from the previous version of the dataset (v1.0) and this new release (v2.0) are summarized below:

Softcite dataset version v1.0 (2020) v2.0 (2023)
number of documents 4,971 4,971
--- --- ---
software name (total) 4,093 5,134
- environment   1,089
- component   88
- implicit   106
--- --- ---
version 1,258 1,478
publisher 1,111 1,311
URL 172 231
programming language   71

The additional software mentions were spotted in the articles with automatic and manual screening. They were then validated via the normal double process with reconciliation in case of disagreement.

The refinement of the type of mentioned software and the encoding of the possible relationships between software mentions in the same paragraph have been realized with the same gold-standard annotation approach.

Guidelines

The attached XML annotation guidelines describe the different mark-up and the annotation principles (annotation_guidelines_tei_xml.md).

Files

The version 2.0 of the Softcite dataset contains the following resources:

1. Under the xml/ subdirectory all the XML annotated corpus:

  • softcite_corpus-full.tei.xml: the dataset as a corpus with one TEI entry per document and paragraphs containing one software mention or more.
  • softcite_corpus-holdout-full.tei.xml: for evaluation purpose, the so-called holdout set corresponding to 20% of the full texts with complete text content and software mentions. This holdout set represents a real distribution of mentions in papers and is, therefore, appropriate for evaluation.
  • softcite_corpus-working.tei.xml: the subset of the corpus excluding the 20% of the full texts of the holdout set, for evaluation purposes.

2. The json/ subdirectory contains converted JSON files for the XML corpus. The JSON format is provided for users more comfortable with JSON than XML. The name of the files is similar as for the above XML files but with .json extension. The JSON uses offsets for identifying the position of the annotation spans in paragraphs, which makes it less readable.

Script

We include a Python script used to convert the master TEI XML corpus into JSON format under scripts/TEI2LossyJSON.py

Use like this:

> python3 scripts/TEI2LossyJSON.py --tei-file xml/softcite_corpus-full.tei.xml --output json/

Acknowledgement

We thank Alfred P. Sloan Foundation and NextGenerationEU/France Relance for supporting this work. We also thank our collaborators and student annotators for making this dataset gold-standard and available.

 

Files

softcite_dataset_v2.zip

Files (25.2 MB)

Name Size Download all
md5:9a0a83d4ea9de2a8a6cb35ab43c2bea6
25.2 MB Preview Download