figshare
Browse
1/1
6 files

Nerwip Corpus

Version 17 2015-03-04, 13:43
Version 16 2015-03-04, 13:43
Version 15 2015-03-04, 13:42
Version 14 2015-03-04, 13:40
Version 13 2015-03-04, 13:39
Version 12 2015-03-04, 10:14
Version 11 2015-03-04, 10:13
Version 10 2015-03-04, 10:13
Version 9 2015-03-04, 10:13
Version 8 2015-03-04, 10:13
Version 7 2015-03-04, 10:13
Version 6 2015-03-04, 10:12
Version 5 2015-02-26, 19:49
Version 4 2015-02-26, 19:49
Version 3 2015-02-26, 19:46
Version 2 2015-02-26, 18:05
Version 1 2015-01-15, 13:01
dataset
posted on 2015-03-04, 13:43 authored by Vincent LabatutVincent Labatut

This corpus contains 408 Wikipedia articles. Those are biographies, manually annotated to higlight entities of the following types: Dates, Locations, Organizations and Persons. It was designed to be used by our tool Nerwip (https://github.com/CompNet/nerwip), in order to evaluate and compare existing NER tools on biographic data.

It was constituted by Burcu Küpelioglu during her end of study project, and then cleaned and corrected by Samet Atdag during his MSc, to get a total of 250 articles (v3). Vincent Labatut then completed it further, to reach 408 articles (v4).

The dataset is shared under a Creative Commons 0 license. If you use it, please cite the following article: A Comparison of Named Entity Recognition Tools Applied to Biographical Texts, S. Atdag & V. Labatut, 2013. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6632052&tag=1

The other files are NER tools-related data (models, dictionaries, etc.), needed by Nerwip to detect entities. If you want to use the tool, you need to unzip these files as explained in the README file associated to Nerwip on GitHub.

History

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC