Resources for Text, Speech and Language Processing

Tagged datasets for named entity recognition tasks

  1. 1999 Information Extraction -- Entity Recognition Evaluation
    Notes: This dataset is apparently in public domain.
  2. MUC-3 and MUC-4 datasets
    Notes: This dataset is apparently in public domain.
  3. Language-Independent Named Entity Recognition at CoNLL-2003
    Notes: This dataset is a manual annotation of a subset of RCV1 (Reuters Corpus Volume 1). The annotation per se is available free of charge (subject to a licensing agreement) from the CoNLL site. The raw text of RCV1 documents must be requested from NIST (also free of charge and also subject to a licensing agreement).
  4. Message Understanding Conference (MUC) 6
    Notes: Consult the LDC Web site for current pricing and usage agreement.
  5. Message Understanding Conference (MUC) 6 Additional News Text
    Notes: Consult the LDC Web site for current pricing and usage agreement.
  6. Message Understanding Conference (MUC) 7
    Notes: Consult the LDC Web site for current pricing and usage agreement.
  7. ACE-2 Version 1.0
    Notes: Consult the LDC Web site for current pricing and usage agreement.
  8. TIDES Extraction (ACE) 2003 Multilingual Training Data
    Notes: Consult the LDC Web site for current pricing and usage agreement.
  9. ACE 2004 Multilingual Training Corpus
    Notes: Consult the LDC Web site for current pricing and usage agreement.
  10. Name-Annotated TDT Corpus Supplement for ACE
    Notes: Consult the LDC Web site for current pricing and usage agreement.
  11. Enron Email Dataset
    Notes: Email messages in this corpus are tagged with person names, dates and times.
  12. A variety of biomedical corpora
    Notes: Some corpora in this collection are tagged with entities in the biomedical domain, such as gene names.
  13. Automatic Content Extraction (ACE)
    Notes: Homepage of the ACE program.
↑ Back to top