Towards a "Universal dictionary" for multi-language information retrieval applications

Authors:
J. Michael Schultz;Mark Y. Liberman
Affiliations:
University of Pennsylvania;University of Pennsylvania
Venue:
Topic detection and tracking
Year:
2002

Citing 1
Cited 3

Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval

Relevance models for topic detection and tracking

HLT '02 Proceedings of the second international conference on Human Language Technology Research
Story link detection based on event words

CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part II
Convergence of influential bloggers for topic discovery in the blogosphere

SBP'10 Proceedings of the Third international conference on Social Computing, Behavioral Modeling, and Prediction

Quantified Score

Hi-index	0.00

Visualization

Abstract

Multilingual information retrieval tasks such as Topic Tracking have yielded high-quality results simply using word-by-word translation approaches. However, the construction of translation dictionaries for new languages is expensive and time-consuming. We show that an appropriate metric for term selection in a monolingual English corpus allows us to define a fairly small list, containing about ten thousand inflected forms or about 7500 lemmas, which works essentially as well (for a particular monolingual document classification evaluation) as an unlimited vocabulary of more than 300,000 word forms does. We suggest that such a list can be taken to form the English axis of a sort of "universal dictionary" for document classification tasks, providing a much more efficient path to the addition of new languages.