The bag-of-repeats representation of documents

Authors:
Matthias Gallé
Affiliations:
Xerox Research Centre Europe, Meylan, France
Venue:
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Year:
2013

Citing 8
Cited 0

Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Information Retrieval

Information Retrieval
Modeling annotated data

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Text classification using string kernels

The Journal of Machine Learning Research
Topic modeling: beyond bag-of-words

ICML '06 Proceedings of the 23rd international conference on Machine learning
Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Probabilistic topic models

Communications of the ACM
Full and mini-batch clustering of news articles with Star-EM

ECIR'12 Proceedings of the 34th European conference on Advances in Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

n-gram representations of documents may improve over a simple bag-of-word representation by relaxing the independence assumption of word and introducing context. However, this comes at a cost of adding features which are non-descriptive, and increasing the dimension of the vector space model exponentially. We present new representations that avoid both pitfalls. They are based on sound theoretical notions of stringology, and can be computed in optimal asymptotic time with algorithms using data structures from the suffix family. While maximal repeats have been used in the past for similar tasks, we show how another equivalence class of repeats -- largest-maximal repeats -- obtain similar or better results, with only a fraction of the features. This class acts as a minimal generative basis of all repeated substrings. We also report their use for topic modeling, showing easier to interpret models.