Comparison between tagged corpora for the named entity task

Authors:
Chikashi Nobata;Nigel Collier;Jun'ichi Tsujii
Affiliations:
Kansai Advanced Research Center, Kobe, Hyogo, Japan;University of Tokyo, Bunkyo-ku, Tokyo, Japan;University of Tokyo, Bunkyo-ku, Tokyo, Japan
Venue:
CompareCorpora '00 Proceedings of the Workshop on Comparing Corpora
Year:
2000

Citing 8
Cited 5

C4.5: programs for machine learning

C4.5: programs for machine learning
Information Retrieval

Information Retrieval
Constructing Biological Knowledge Bases by Extracting Information from Text Sources

Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology
A statistical profile of the Named Entity task

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Nymble: a high-performance learning name-finder

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
An empirical study of smoothing techniques for language modeling

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Extracting the names of genes and gene products with a hidden Markov model

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
MUC-5 evaluation metrics

MUC5 '93 Proceedings of the 5th conference on Message understanding

Named entity recognition in biomedical texts using an HMM model

JNLPBA '04 Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications
Bio-medical entity extraction using support vector machines

Artificial Intelligence in Medicine
A novel approach to automatic gazetteer generation using Wikipedia

People's Web '09 Proceedings of the 2009 Workshop on The People's Web Meets NLP: Collaboratively Constructed Semantic Resources
Named entity recognition in Wikipedia

People's Web '09 Proceedings of the 2009 Workshop on The People's Web Meets NLP: Collaboratively Constructed Semantic Resources
Learning multilingual named entity recognition from Wikipedia

Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present two measures for comparing corpora based on information theory statistics such as gain ratio as well as simple term-class frequency counts. We tested the predictions made by these measures about corpus difficulty in two domains --- news and molecular biology --- using the result of two well-used paradigms for NE, decision trees and HMMs and found that gain ratio was the more reliable predictor.