ACM SIGIR Forum
Viewing morphology as an inference process
SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Combining labeled and unlabeled data with co-training
COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Information Filtering: Overview of Issues, Research and Systems
User Modeling and User-Adapted Interaction
Authorship verification as a one-class classification problem
ICML '04 Proceedings of the twenty-first international conference on Machine learning
Thumbs up?: sentiment classification using machine learning techniques
EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Detecting spam web pages through content analysis
Proceedings of the 15th international conference on World Wide Web
A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts
ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Creating, destroying, and restoring value in wikipedia
Proceedings of the 2007 international ACM conference on Supporting group work
A survey of modern authorship attribution methods
Journal of the American Society for Information Science and Technology
A survey of learning-based techniques of email spam filtering
Artificial Intelligence Review
Identifying featured articles in wikipedia: writing style matters
Proceedings of the 19th international conference on World wide web
A comparison of language identification approaches on short, query-style texts
ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Hi-index | 0.00 |
Besides the content the writing style is an important discriminator in information filtering tasks. Ideally, the solution of a filtering task employs a text representation that models both kinds of characteristics. In this respect word stems are clearly content capturing, whereas word suffixes qualify as writing style indicators. Though the latter feature type is used for part of speech tagging, it has not yet been employed for information filtering in general. We propose a text representation that combines both the output of a stemming algorithm (stems) and the stem-reduced words (co-stems). A co-stem can be a prefix, an infix, a suffix, or a concatenation of prefixes, infixes, or suffixes. Using accepted standard corpora, we analyze the discriminative power of this representation for a broad range of information filtering tasks to provide new insights into the adequacy and task-specificity of text representation models. Altogether we observe that co-stem-based representations outperform the classical bag of words model for several filtering tasks.