Short and informal documents: a probabilistic model for description enrichment

  • Authors:
  • Yuval Merhav;Ophir Frieder

  • Affiliations:
  • Information Retrieval Lab, Computer Science Department, Illinois Institute of Technology, Chicago, Illinois;Information Retrieval Lab, Computer Science Department, Illinois Institute of Technology, Chicago, Illinois

  • Venue:
  • NGITS'09 Proceedings of the 7th international conference on Next generation information technologies and systems
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

While lexical statistics of formal text play a central role in many statistical Natural Language Processing (NLP) and Information Retrieval (IR) tasks, there is little known about lexical statistics of informal and short documents. To learn the unique characteristics of informal text, we construct an N-gram study on P2P data, and present the insights, problems, and differences from formal text. Consequently, we apply a probabilistic model for detecting and correcting spelling problems (not necessarily errors) and propose an enrichment method that makes many P2P files better accessible to relevant user queries. Our enrichment results show an improvement in both recall and precision with only a slight increase in the collection size.