Comparison between tagged corpora for the named entity task

Authors:
Chikashi Nobata;Nigel Collier;Jun'ichi Tsujii
Affiliations:
Kansai Advanced Research Center, Iwaoka-cho, Nishi-ku, Kobe, Hyogo, Japan;University of Tokyo, Bunkyo-ku, Tokyo, Japan;University of Tokyo, Bunkyo-ku, Tokyo, Japan
Venue:
WCC '00 Proceedings of the workshop on Comparing corpora - Volume 9
Year:
2000

Citing 9
Cited 7

C4.5: programs for machine learning

C4.5: programs for machine learning
Information Retrieval

Information Retrieval
Constructing Biological Knowledge Bases by Extracting Information from Text Sources

Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology
A maximum entropy approach to named entity recognition

A maximum entropy approach to named entity recognition
A statistical profile of the Named Entity task

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Nymble: a high-performance learning name-finder

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
An empirical study of smoothing techniques for language modeling

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Extracting the names of genes and gene products with a hidden Markov model

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
MUC-5 evaluation metrics

MUC5 '93 Proceedings of the 5th conference on Message understanding

Rutabaga by any other name: extracting biological names

Journal of Biomedical Informatics - Special issue: Sublanguage
Automatically identifying gene/protein terms in MEDLINE abstracts

Journal of Biomedical Informatics
Enhancing HMM-based biomedical named entity recognition by studying special phenomena

Journal of Biomedical Informatics - Special issue: Named entity recognition in biomedicine
Comparison of character-level and part of speech features for name recognition in biomedical texts

Journal of Biomedical Informatics - Special issue: Named entity recognition in biomedicine
Tuning support vector machines for biomedical named entity recognition

BioMed '02 Proceedings of the ACL-02 workshop on Natural language processing in the biomedical domain - Volume 3
Use of support vector machines in extended named entity recognition

COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
Effective adaptation of a Hidden Markov Model-based named entity recognizer for biomedical domain

BioMed '03 Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine - Volume 13

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present two measures for comparing corpora based on information theory statistics such as gain ratio as well as simple term-class frequency counts. We tested the predictions made by these measures about corpus difficulty in two domains --- news and molecular biology --- using the result of two well-used paradigms for NE, decision trees and HMMs and found that gain ratio was the more reliable predictor.