AutoPCS: A Phrase-Based Text Categorization System for Similar Texts

  • Authors:
  • Zhixu Li;Pei Li;Wei Wei;Hongyan Liu;Jun He;Tao Liu;Xiaoyong Du

  • Affiliations:
  • Key Labs of Data Engineering and Knowledge Engineering, Ministry of Education, China and School of Information, Renmin University of China, Beijing, China;Key Labs of Data Engineering and Knowledge Engineering, Ministry of Education, China and School of Information, Renmin University of China, Beijing, China;Key Labs of Data Engineering and Knowledge Engineering, Ministry of Education, China and School of Information, Renmin University of China, Beijing, China;Department of Management Science and Engineering, Tsinghua University, Beijing, China;Key Labs of Data Engineering and Knowledge Engineering, Ministry of Education, China and School of Information, Renmin University of China, Beijing, China;Key Labs of Data Engineering and Knowledge Engineering, Ministry of Education, China and School of Information, Renmin University of China, Beijing, China;Key Labs of Data Engineering and Knowledge Engineering, Ministry of Education, China and School of Information, Renmin University of China, Beijing, China

  • Venue:
  • APWeb/WAIM '09 Proceedings of the Joint International Conferences on Advances in Data and Web Management
  • Year:
  • 2009

Quantified Score

Hi-index 0.04

Visualization

Abstract

Nearly all text classification methods classify texts into predefined categories according to the terms appeared in texts. State-of-the-art of text classification prefer to simplely take a word as a term since it performs good on some famous datasets; some experts even pointed out that phrases don't improve or improve only marginally the classifiction accuracy. However, we found out that this is not always true when we try to categorize texts about similar topics in the same domain. With words only we can not categorize those texts effectively since they nearly share the same word set. Then we suppose the results might be improved if we also use phrases as terms. To testify our supposition, we propose our own phrase extraction way as well as select proper feature selection method and classifier by conducting experimental study on a data set which comes from paper abstracts in the field of Databases . Accordingly, we also develop a system called AutoPCS which can be used to help experts in choosing relevant topics for newly coming papers from a predefined topic list only by their abstracts.