Resources for Text, Speech and Language Processing

Tagged datasets for named entity recognition tasks

1999 Information Extraction -- Entity Recognition Evaluation
Notes: This dataset is apparently in public domain.
MUC-3 and MUC-4 datasets
Notes: This dataset is apparently in public domain.
Language-Independent Named Entity Recognition at CoNLL-2003
Notes: This dataset is a manual annotation of a subset of RCV1 (Reuters Corpus Volume 1). The annotation per se is available free of charge (subject to a licensing agreement) from the CoNLL site. The raw text of RCV1 documents must be requested from NIST (also free of charge and also subject to a licensing agreement).
Message Understanding Conference (MUC) 6
Notes: Consult the LDC Web site for current pricing and usage agreement.
Message Understanding Conference (MUC) 6 Additional News Text
Notes: Consult the LDC Web site for current pricing and usage agreement.
Message Understanding Conference (MUC) 7
Notes: Consult the LDC Web site for current pricing and usage agreement.
ACE-2 Version 1.0
Notes: Consult the LDC Web site for current pricing and usage agreement.
TIDES Extraction (ACE) 2003 Multilingual Training Data
Notes: Consult the LDC Web site for current pricing and usage agreement.
ACE 2004 Multilingual Training Corpus
Notes: Consult the LDC Web site for current pricing and usage agreement.
Name-Annotated TDT Corpus Supplement for ACE
Notes: Consult the LDC Web site for current pricing and usage agreement.
Enron Email Dataset
Notes: Email messages in this corpus are tagged with person names, dates and times.
A variety of biomedical corpora
Notes: Some corpora in this collection are tagged with entities in the biomedical domain, such as gene names.
Automatic Content Extraction (ACE)
Notes: Homepage of the ACE program.

↑ Back to top