Short and informal documents: a probabilistic model for description enrichment

Authors:
Yuval Merhav;Ophir Frieder
Affiliations:
Information Retrieval Lab, Computer Science Department, Illinois Institute of Technology, Chicago, Illinois;Information Retrieval Lab, Computer Science Department, Illinois Institute of Technology, Chicago, Illinois
Venue:
NGITS'09 Proceedings of the 7th international conference on Next generation information technologies and systems
Year:
2009

Citing 11
Cited 0

A Winnow-Based Approach to Context-Sensitive Spelling Correction

Machine Learning - Special issue on natural language learning
Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition

Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition
Contextual spelling correction using latent semantic analysis

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Decision lists for lexical ambiguity resolution: application to accent restoration in Spanish and French

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
Combining Trigram-based and feature-based methods for context-sensitive spelling correction

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Labeling images with a computer game

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
AnnoSearch: Image Auto-Annotation by Search

CVPR '06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2
Extracting personal names from email: applying named entity recognition to informal text

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
N-gram Statistics in English and Chinese: Similarities and Differences

ICSC '07 Proceedings of the International Conference on Semantic Computing
On filtering irrelevant results in peer-to-peer search

Proceedings of the 2008 ACM symposium on Applied computing
On multiword entity ranking in peer-to-peer search

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

While lexical statistics of formal text play a central role in many statistical Natural Language Processing (NLP) and Information Retrieval (IR) tasks, there is little known about lexical statistics of informal and short documents. To learn the unique characteristics of informal text, we construct an N-gram study on P2P data, and present the insights, problems, and differences from formal text. Consequently, we apply a probabilistic model for detecting and correcting spelling problems (not necessarily errors) and propose an enrichment method that makes many P2P files better accessible to relevant user queries. Our enrichment results show an improvement in both recall and precision with only a slight increase in the collection size.