Beyond the bag-of-words paradigm to enhance information retrieval applications

Authors:
Paolo Ferragina
Affiliations:
University of Pisa, Italy
Venue:
Proceedings of the Fourth International Conference on SImilarity Search and APplications
Year:
2011

Citing 11
Cited 0

Wikify!: linking documents to encyclopedic knowledge

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Introduction to Information Retrieval

Introduction to Information Retrieval
Web-scale named entity recognition

Proceedings of the 17th ACM conference on Information and knowledge management
Learning to link with wikipedia

Proceedings of the 17th ACM conference on Information and knowledge management
Query by document

Proceedings of the Second ACM International Conference on Web Search and Data Mining
A survey of Web clustering engines

ACM Computing Surveys (CSUR)
Collective annotation of Wikipedia entities in web text

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Wikipedia-based semantic interpretation for natural language processing

Journal of Artificial Intelligence Research
Exploiting internal and external semantics for the clustering of short texts using world knowledge

Proceedings of the 18th ACM conference on Information and knowledge management
TAGME: on-the-fly annotation of short text fragments (by wikipedia entities)

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Improving quality of search results clustering with approximate matrix factorisations

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

The typical IR-approach to indexing, clustering, classification and retrieval, just to name a few, is the one based on the bag-of-words paradigm. It eventually transforms a text into an array of terms, possibly weighted (with tf-idf scores or derivatives), and then represents that array via points in highly-dimensional space. It is therefore syntactical and unstructured, in the sense that different terms lead to different dimensions. Co-occurrence detection and other processing steps have been thus proposed (see e.g. LSI, Spectral analysis [7]) to identify the existence of those relations, but yet everyone is aware of the limitations of this approach especially in the expanding context of short (and thus poorly composed) texts, such as the snippets of search-engine results, the tweets of a Twitter channel, the items of a news feed, the posts of a blog, or the advertisement messages, etc..