There is a newer version of the record available.

Published June 6, 2022 | Version 0.0.0
Dataset Open

TweetNERD - End to End Entity Linking Benchmark for Tweets

Description

TweetNERD - End to End Entity Linking Benchmark for Tweets

Paper - Video - Neurips Page

This is the dataset described in the paper TweetNERD - End to End Entity Linking Benchmark for Tweets (accepted to Thirty-sixth Conference on Neural Information Processing Systems (Neurips) Datasets and Benchmarks Track).

Named Entity Recognition and Disambiguation (NERD) systems are foundational for information retrieval, question answering, event detection, and other natural language processing (NLP) applications. We introduce TweetNERD, a dataset of 340K+ Tweets across 2010-2021, for benchmarking NERD systems on Tweets. This is the largest and most temporally diverse open sourced dataset benchmark for NERD on Tweets and can be used to facilitate research in this area.

TweetNERD dataset is released under Creative Commons Attribution 4.0 International (CC BY 4.0) LICENSE.

The license only applies to the data files present in this dataset. See Data usage policy below.

Check out more details at https://github.com/twitter-research/TweetNERD

Usage

We provide the dataset split across the following tab seperated files:

  • OOD.public.tsv: OOD split of the data in the paper.
  • Academic.public.tsv: Academic split of the data described in the paper.
  • part_*.public.tsv: Remaining data split into parts in no particular order.

Each file is tab separated and has has the following format:

tweet_id phrase start end entityId score
22 twttr 20 25 Q918 3
21 twttr 20 25 Q918 3
1457198399032287235 Diwali 30 38 Q10244 3
1232456079247736833 NO_PHRASE -1 -1 NO_ENTITY -1

For tweets which don't have any entity, their column values for phrase, start, end, entityId, score are set NO_PHRASE, -1, -1, NO_ENTITY, -1 respectively.

Description of file columns is as follows:

Column Type Missing Value Description
tweet_id string   ID of the Tweet
phrase string NO_PHRASE entity phrase
start int -1 start offset of the phrase in text using UTF-16BE encoding
end int -1 end offset of the phrase in the text using UTF-16BE encoding
entityId string NO_ENTITY Entity ID. If not missing can be NOT FOUND, AMBIGUOUS, or Wikidata ID of format Q{numbers}, e.g. Q918
score int -1 Number of annotators who agreed on the phrase, start, end, entityId information

In order to use the dataset you need to utilize the tweet_id column and get the Tweet text using the Twitter API (See Data usage policy section below).

Data stats

Split Number of Rows Number unique tweets
OOD 34102 25000
Academic 51685 30119
part_0 11830 10000
part_1 35681 25799
part_2 34256 25000
part_3 36478 25000
part_4 37518 24999
part_5 36626 25000
part_6 34001 24984
part_7 34125 24981
part_8 32556 25000
part_9 32657 25000
part_10 32442 25000
part_11 32033 24972

Data usage policy

Use of this dataset is subject to you obtaining lawful access to the Twitter API, which requires you to agree to the Developer Terms Policies and Agreements.

Please cite the following if you use TweetNERD in your paper:

@dataset{TweetNERD_Zenodo_2022_6617192,
  author       = {Mishra, Shubhanshu and
                  Saini, Aman and
                  Makki, Raheleh and
                  Mehta, Sneha and
                  Haghighi, Aria and
                  Mollahosseini, Ali},
  title        = {{TweetNERD - End to End Entity Linking Benchmark 
                   for Tweets}},
  month        = jun,
  year         = 2022,
  note         = {{Data usage policy  Use of this dataset is subject 
                   to you obtaining lawful access to the [Twitter
                   API](https://developer.twitter.com/en/docs
                   /twitter-api), which requires you to agree to the
                   [Developer Terms Policies and
                   Agreements](https://developer.twitter.com/en
                   /developer-terms/).}},
  publisher    = {Zenodo},
  version      = {0.0.0},
  doi          = {10.5281/zenodo.6617192},
  url          = {https://doi.org/10.5281/zenodo.6617192}
}
@inproceedings{TweetNERDNeurips2022,
 author = {Mishra, Shubhanshu and Saini, Aman and Makki, Raheleh and Mehta, Sneha and Haghighi, Aria and Mollahosseini, Ali},
 booktitle = {Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks},
 pages = {},
 title = {TweetNERD - End to End Entity Linking Benchmark for Tweets},
 volume = {2},
 year = {2022},
 eprint = {arXiv:2210.08129},
 doi = {10.48550/arXiv.2210.08129}
}

Notes

Data usage policy Use of this dataset is subject to you obtaining lawful access to the [Twitter API](https://developer.twitter.com/en/docs/twitter-api), which requires you to agree to the [Developer Terms Policies and Agreements](https://developer.twitter.com/en/developer-terms/).

Files

README.md

Files (22.7 MB)

Name Size Download all
md5:cfa247ab553e9c075e2bbb76d5205600
2.2 MB Download
md5:7c7240692abe65b77e8e1be007e2c369
1.7 MB Download
md5:939363c52f7a9b0d7641d18718612582
561.7 kB Download
md5:e237d3d4ec9d25fce014f4121f97550e
1.8 MB Download
md5:00a8fe795d3f2d97e0f314c2bdbbeaee
1.5 MB Download
md5:7252f28baec0f6f23ed56cd6d125f125
1.5 MB Download
md5:d6c05c5013341cabcd483405821648d5
1.6 MB Download
md5:47c0ac603b3dc98c98ed9a3f6f382734
1.8 MB Download
md5:a6cac2bf28df877c38218ea157ea94b3
1.8 MB Download
md5:3bf9f5907197f38b5f4be3baad4ca839
1.8 MB Download
md5:67e27bd542f1454ad43f36407708fd23
1.7 MB Download
md5:e3666c4e7ca2493b9d7150a0f4ccd1f5
1.6 MB Download
md5:1a4ea02be63206db2451ed4fea834e1c
1.5 MB Download
md5:78ba7304cf2a47503b9db702f8ae7fd7
1.6 MB Download
md5:0b9a4625539b4e35f162590cc25f1236
4.5 kB Preview Download

Additional details

Related works

Is cited by
Preprint: 10.48550/arXiv.2210.08129 (DOI)