Term clustering of syntactic phrases
SIGIR '90 Proceedings of the 13th annual international ACM SIGIR conference on Research and development in information retrieval
Experiment on linguistically-based term associations
Information Processing and Management: an International Journal
Context-sensitive learning methods for text categorization
ACM Transactions on Information Systems (TOIS)
Text databases & document management
Natural Language Information Retrieval
Natural Language Information Retrieval
General Convergence Results for Linear Discriminant Updates
Machine Learning
Uncertainty-Based Noise Reduction and Term Selection in Text Categorization
Proceedings of the 24th BCS-IRSG European Colloquium on IR Research: Advances in Information Retrieval
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Proceedings of the FREENIX Track: 2002 USENIX Annual Technical Conference
TTP: a fast and robust parser for natural language
COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 1
A dependency-based method for evaluating broad-coverage parsers
IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2
Normalization and matching in the DORO system
IRSG'99 Proceedings of the 21st Annual BCS-IRSG conference on Information Retrieval Research
Head/modifier pairs for everyone
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Introduction to the special issue on patent processing
Information Processing and Management: an International Journal
AutoPCS: A Phrase-Based Text Categorization System for Similar Texts
APWeb/WAIM '09 Proceedings of the Joint International Conferences on Advances in Data and Web Management
Using phrases as features in email classification
Journal of Systems and Software
Phrase-based document categorization revisited
Proceedings of the 2nd international workshop on Patent information retrieval
On the importance of parameter tuning in text categorization
PSI'06 Proceedings of the 6th international Andrei Ershov memorial conference on Perspectives of systems informatics
Smoothing LDA model for text categorization
AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
Hi-index | 0.00 |
In this paper the suitability of different document representations for automatic document classification is compared, investigating a whole range of representations between bag-of-words and bag-of-phrases. We look at some of their statistical properties, and determine for each representation the optimal choice of classification parameters and the effect of Term Selection. Phrases are represented by an abstraction called Head/Modifier pairs. Rather than just throwing phrases and keywords together, we shall start with pure HM pairs and gradually add more keywords to the document representation. We use the classification on keywords as the baseline, which we compare with the contribution of the pure HM pairs to classification accuracy, and the incremental contributions from heads and modifiers. Finally, we measure the accuracy achieved with all words and all HM pairs combined, which turns out to be only marginally above the baseline. We conclude that even the most careful term selection cannot overcome the differences in Document Frequency between phrases and words, and propose the use of term clustering to make phrases more cooperative.