Tagged datasets for named entity recognition tasks
1999 Information
Extraction -- Entity Recognition Evaluation
Notes: This dataset is apparently in public domain.
MUC-3
and MUC-4 datasets
Notes: This dataset is apparently in public domain.
Language-Independent
Named Entity Recognition at CoNLL-2003
Notes: This dataset is a manual annotation of a subset of
RCV1 (Reuters Corpus Volume 1) .
The annotation per se is available free of charge (subject to a licensing agreement)
from the CoNLL site. The raw text of RCV1 documents must be
requested from NIST
(also free of charge and also subject to a licensing agreement).
Message
Understanding Conference (MUC) 6
Notes: Consult the LDC
Web site for current pricing and usage agreement.
Message
Understanding Conference (MUC) 6 Additional News Text
Notes: Consult the LDC
Web site for current pricing and usage agreement.
Message
Understanding Conference (MUC) 7
Notes: Consult the LDC
Web site for current pricing and usage agreement.
ACE-2
Version 1.0
Notes: Consult the LDC
Web site for current pricing and usage agreement.
TIDES
Extraction (ACE) 2003 Multilingual Training Data
Notes: Consult the LDC
Web site for current pricing and usage agreement.
ACE 2004 Multilingual Training Corpus
Notes: Consult the LDC
Web site for current pricing and usage agreement.
Name-Annotated
TDT Corpus Supplement for ACE
Notes: Consult the LDC
Web site for current pricing and usage agreement.
Enron Email Dataset
Notes: Email messages in this corpus are tagged with person names, dates and times.
A variety of biomedical corpora
Notes: Some corpora in this collection are tagged with entities in the biomedical
domain, such as gene names.
Automatic Content Extraction (ACE)
Notes: Homepage of the ACE program.
↑ Back to top