AutoPCS: A Phrase-Based Text Categorization System for Similar Texts

Authors:
Zhixu Li;Pei Li;Wei Wei;Hongyan Liu;Jun He;Tao Liu;Xiaoyong Du
Affiliations:
Key Labs of Data Engineering and Knowledge Engineering, Ministry of Education, China and School of Information, Renmin University of China, Beijing, China;Key Labs of Data Engineering and Knowledge Engineering, Ministry of Education, China and School of Information, Renmin University of China, Beijing, China;Key Labs of Data Engineering and Knowledge Engineering, Ministry of Education, China and School of Information, Renmin University of China, Beijing, China;Department of Management Science and Engineering, Tsinghua University, Beijing, China;Key Labs of Data Engineering and Knowledge Engineering, Ministry of Education, China and School of Information, Renmin University of China, Beijing, China;Key Labs of Data Engineering and Knowledge Engineering, Ministry of Education, China and School of Information, Renmin University of China, Beijing, China;Key Labs of Data Engineering and Knowledge Engineering, Ministry of Education, China and School of Information, Renmin University of China, Beijing, China
Venue:
APWeb/WAIM '09 Proceedings of the Joint International Conferences on Advances in Data and Web Management
Year:
2009

Citing 18
Cited 1

An evaluation of phrasal and clustered representations on a text categorization task

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Expert network: effective and efficient learning from human decisions in text categorization and retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
The nature of statistical learning theory

The nature of statistical learning theory
Induction of fuzzy decision trees

Fuzzy Sets and Systems
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss

Machine Learning - Special issue on learning with probabilistic representations
Bayesian Network Classifiers

Machine Learning - Special issue on learning with probabilistic representations
Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
BoosTexter: A Boosting-based Systemfor Text Categorization

Machine Learning - Special issue on information retrieval
A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization

Text databases & document management
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
The use of bigrams to enhance text categorization

Information Processing and Management: an International Journal
Maximizing Text-Mining Performance

IEEE Intelligent Systems
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Second Order Features for Maximising Text Classification Performance

EMCL '01 Proceedings of the 12th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Feature Engineering for Text Classification

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Taming wild phrases

ECIR'03 Proceedings of the 25th European conference on IR research

The impact of conceptualization on text classification

WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering

Quantified Score

Hi-index	0.04

Visualization

Abstract

Nearly all text classification methods classify texts into predefined categories according to the terms appeared in texts. State-of-the-art of text classification prefer to simplely take a word as a term since it performs good on some famous datasets; some experts even pointed out that phrases don't improve or improve only marginally the classifiction accuracy. However, we found out that this is not always true when we try to categorize texts about similar topics in the same domain. With words only we can not categorize those texts effectively since they nearly share the same word set. Then we suppose the results might be improved if we also use phrases as terms. To testify our supposition, we propose our own phrase extraction way as well as select proper feature selection method and classifier by conducting experimental study on a data set which comes from paper abstracts in the field of Databases . Accordingly, we also develop a system called AutoPCS which can be used to help experts in choosing relevant topics for newly coming papers from a predefined topic list only by their abstracts.