Comparison between tagged corpora for the named entity task

  • Authors:
  • Chikashi Nobata;Nigel Collier;Jun'ichi Tsujii

  • Affiliations:
  • Kansai Advanced Research Center, Iwaoka-cho, Nishi-ku, Kobe, Hyogo, Japan;University of Tokyo, Bunkyo-ku, Tokyo, Japan;University of Tokyo, Bunkyo-ku, Tokyo, Japan

  • Venue:
  • WCC '00 Proceedings of the workshop on Comparing corpora - Volume 9
  • Year:
  • 2000

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present two measures for comparing corpora based on information theory statistics such as gain ratio as well as simple term-class frequency counts. We tested the predictions made by these measures about corpus difficulty in two domains --- news and molecular biology --- using the result of two well-used paradigms for NE, decision trees and HMMs and found that gain ratio was the more reliable predictor.