Towards a "Universal dictionary" for multi-language information retrieval applications

  • Authors:
  • J. Michael Schultz;Mark Y. Liberman

  • Affiliations:
  • University of Pennsylvania;University of Pennsylvania

  • Venue:
  • Topic detection and tracking
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

Multilingual information retrieval tasks such as Topic Tracking have yielded high-quality results simply using word-by-word translation approaches. However, the construction of translation dictionaries for new languages is expensive and time-consuming. We show that an appropriate metric for term selection in a monolingual English corpus allows us to define a fairly small list, containing about ten thousand inflected forms or about 7500 lemmas, which works essentially as well (for a particular monolingual document classification evaluation) as an unlimited vocabulary of more than 300,000 word forms does. We suggest that such a list can be taken to form the English axis of a sort of "universal dictionary" for document classification tasks, providing a much more efficient path to the addition of new languages.