A New Type of Feature --- Loose N-Gram Feature in Text Categorization

Authors:
Xian Zhang;Xiaoyan Zhu
Affiliations:
Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China;Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
Venue:
IbPRIA '07 Proceedings of the 3rd Iberian conference on Pattern Recognition and Image Analysis, Part I
Year:
2007

Citing 9
Cited 1

Pivoted document length normalization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Context-sensitive learning methods for text categorization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
The use of bigrams to enhance text categorization

Information Processing and Management: an International Journal
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Text classification using string kernels

The Journal of Machine Learning Research
Tagging sentence boundaries

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
From N-grams to collocations: an evaluation of Xtract

ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics

Building a topic hierarchy using the bag-of-related-words representation

Proceedings of the 11th ACM symposium on Document engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper introduces a new type of feature in text categorization. Based on an interesting linguistic observation, Loose N-gram feature, defined as co-occurring words within limited range, is quite different from traditional features, such as words, phrases or n-grams. Not only retaining useful context information, this kind of feature also has considerable classification ability. The features generated by our algorithm have acceptable statistical characteristics, thus can effectively avoid the sparseness problem. Experiment results show that the Loose N-gram feature is helpful and promising in statistical text categorization systems, especially for the categorization tasks which rely on more semantic information. Our new type of feature could also be helpful in Information Retrieval research.