Introduction to Modern Information Retrieval
Introduction to Modern Information Retrieval
Relevance models for topic detection and tracking
HLT '02 Proceedings of the second international conference on Human Language Technology Research
Story link detection based on event words
CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part II
Convergence of influential bloggers for topic discovery in the blogosphere
SBP'10 Proceedings of the Third international conference on Social Computing, Behavioral Modeling, and Prediction
Hi-index | 0.00 |
Multilingual information retrieval tasks such as Topic Tracking have yielded high-quality results simply using word-by-word translation approaches. However, the construction of translation dictionaries for new languages is expensive and time-consuming. We show that an appropriate metric for term selection in a monolingual English corpus allows us to define a fairly small list, containing about ten thousand inflected forms or about 7500 lemmas, which works essentially as well (for a particular monolingual document classification evaluation) as an unlimited vocabulary of more than 300,000 word forms does. We suggest that such a list can be taken to form the English axis of a sort of "universal dictionary" for document classification tasks, providing a much more efficient path to the addition of new languages.