Classifying with co-stems: a new representation for information filtering

Authors:
Nedim Lipka;Benno Stein
Affiliations:
Bauhaus-Universität Weimar, Weimar, Germany;Bauhaus-Universität Weimar, Weimar, Germany
Venue:
ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Year:
2011

Citing 13
Cited 0

Another stemmer

ACM SIGIR Forum
Viewing morphology as an inference process

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Information Filtering: Overview of Issues, Research and Systems

User Modeling and User-Adapted Interaction
Authorship verification as a one-class classification problem

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Thumbs up?: sentiment classification using machine learning techniques

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Detecting spam web pages through content analysis

Proceedings of the 15th international conference on World Wide Web
A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Creating, destroying, and restoring value in wikipedia

Proceedings of the 2007 international ACM conference on Supporting group work
A survey of modern authorship attribution methods

Journal of the American Society for Information Science and Technology
A survey of learning-based techniques of email spam filtering

Artificial Intelligence Review
Identifying featured articles in wikipedia: writing style matters

Proceedings of the 19th international conference on World wide web
A comparison of language identification approaches on short, query-style texts

ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Besides the content the writing style is an important discriminator in information filtering tasks. Ideally, the solution of a filtering task employs a text representation that models both kinds of characteristics. In this respect word stems are clearly content capturing, whereas word suffixes qualify as writing style indicators. Though the latter feature type is used for part of speech tagging, it has not yet been employed for information filtering in general. We propose a text representation that combines both the output of a stemming algorithm (stems) and the stem-reduced words (co-stems). A co-stem can be a prefix, an infix, a suffix, or a concatenation of prefixes, infixes, or suffixes. Using accepted standard corpora, we analyze the discriminative power of this representation for a broad range of information filtering tasks to provide new insights into the adequacy and task-specificity of text representation models. Altogether we observe that co-stem-based representations outperform the classical bag of words model for several filtering tasks.