A New Type of Feature --- Loose N-Gram Feature in Text Categorization

  • Authors:
  • Xian Zhang;Xiaoyan Zhu

  • Affiliations:
  • Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China;Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China

  • Venue:
  • IbPRIA '07 Proceedings of the 3rd Iberian conference on Pattern Recognition and Image Analysis, Part I
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper introduces a new type of feature in text categorization. Based on an interesting linguistic observation, Loose N-gram feature, defined as co-occurring words within limited range, is quite different from traditional features, such as words, phrases or n-grams. Not only retaining useful context information, this kind of feature also has considerable classification ability. The features generated by our algorithm have acceptable statistical characteristics, thus can effectively avoid the sparseness problem. Experiment results show that the Loose N-gram feature is helpful and promising in statistical text categorization systems, especially for the categorization tasks which rely on more semantic information. Our new type of feature could also be helpful in Information Retrieval research.