The Locally Weighted Bag of Words Framework for Document Representation

Authors:
Guy Lebanon;Yi Mao;Joshua Dillon
Affiliations:
-;-;-
Venue:
The Journal of Machine Learning Research
Year:
2007

Citing 0
Cited 8

Movie segmentation into scenes and chapters using locally weighted bag of visual words

Proceedings of the ACM International Conference on Image and Video Retrieval
Language pyramid and multi-scale text analysis

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Local space-time smoothing for version controlled documents

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Local histograms of character N-grams for authorship attribution

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Modeling coherence in ESOL learner texts

Proceedings of the Seventh Workshop on Building Educational Applications Using NLP
Sentiment classification with supervised sequence embedding

ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I
Multimodal late fusion bag of features applied to scene detection

Proceedings of the 19th Brazilian symposium on Multimedia and the web
Persistent homology: an introduction and a new text representation for natural language processing

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

The popular bag of words assumption represents a document as a histogram of word occurrences. While computationally efficient, such a representation is unable to maintain any sequential information. We present an effective sequential document representation that goes beyond the bag of words representation and its n-gram extensions. This representation uses local smoothing to embed documents as smooth curves in the multinomial simplex thereby preserving valuable sequential information. In contrast to bag of words or n-grams, the new representation is able to robustly capture medium and long range sequential trends in the document. We discuss the representation and its geometric properties and demonstrate its applicability for various text processing tasks.