Taming wild phrases

Authors:
Cornelis H. A. Koster;Mark Seutter
Affiliations:
Dept. Comp. Sci., University of Nijmegen, The Netherlands;Dept. Comp. Sci., University of Nijmegen, The Netherlands
Venue:
ECIR'03 Proceedings of the 25th European conference on IR research
Year:
2003

Citing 12
Cited 7

Term clustering of syntactic phrases

SIGIR '90 Proceedings of the 13th annual international ACM SIGIR conference on Research and development in information retrieval
Experiment on linguistically-based term associations

Information Processing and Management: an International Journal
Context-sensitive learning methods for text categorization

ACM Transactions on Information Systems (TOIS)
A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization

Text databases & document management
Natural Language Information Retrieval

Natural Language Information Retrieval
General Convergence Results for Linear Discriminant Updates

Machine Learning
Uncertainty-Based Noise Reduction and Term Selection in Text Categorization

Proceedings of the 24th BCS-IRSG European Colloquium on IR Research: Advances in Information Retrieval
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
The AGFL Grammar Work Lab

Proceedings of the FREENIX Track: 2002 USENIX Annual Technical Conference
TTP: a fast and robust parser for natural language

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 1
A dependency-based method for evaluating broad-coverage parsers

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2
Normalization and matching in the DORO system

IRSG'99 Proceedings of the 21st Annual BCS-IRSG conference on Information Retrieval Research

Head/modifier pairs for everyone

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Introduction to the special issue on patent processing

Information Processing and Management: an International Journal
AutoPCS: A Phrase-Based Text Categorization System for Similar Texts

APWeb/WAIM '09 Proceedings of the Joint International Conferences on Advances in Data and Web Management
Using phrases as features in email classification

Journal of Systems and Software
Phrase-based document categorization revisited

Proceedings of the 2nd international workshop on Patent information retrieval
On the importance of parameter tuning in text categorization

PSI'06 Proceedings of the 6th international Andrei Ershov memorial conference on Perspectives of systems informatics
Smoothing LDA model for text categorization

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper the suitability of different document representations for automatic document classification is compared, investigating a whole range of representations between bag-of-words and bag-of-phrases. We look at some of their statistical properties, and determine for each representation the optimal choice of classification parameters and the effect of Term Selection. Phrases are represented by an abstraction called Head/Modifier pairs. Rather than just throwing phrases and keywords together, we shall start with pure HM pairs and gradually add more keywords to the document representation. We use the classification on keywords as the baseline, which we compare with the contribution of the pure HM pairs to classification accuracy, and the incremental contributions from heads and modifiers. Finally, we measure the accuracy achieved with all words and all HM pairs combined, which turns out to be only marginally above the baseline. We conclude that even the most careful term selection cannot overcome the differences in Document Frequency between phrases and words, and propose the use of term clustering to make phrases more cooperative.