Pivoted document length normalization
SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Context-sensitive learning methods for text categorization
SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
A re-examination of text categorization methods
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization
ACM Computing Surveys (CSUR)
The use of bigrams to enhance text categorization
Information Processing and Management: an International Journal
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Text classification using string kernels
The Journal of Machine Learning Research
NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
From N-grams to collocations: an evaluation of Xtract
ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics
Building a topic hierarchy using the bag-of-related-words representation
Proceedings of the 11th ACM symposium on Document engineering
Hi-index | 0.00 |
This paper introduces a new type of feature in text categorization. Based on an interesting linguistic observation, Loose N-gram feature, defined as co-occurring words within limited range, is quite different from traditional features, such as words, phrases or n-grams. Not only retaining useful context information, this kind of feature also has considerable classification ability. The features generated by our algorithm have acceptable statistical characteristics, thus can effectively avoid the sparseness problem. Experiment results show that the Loose N-gram feature is helpful and promising in statistical text categorization systems, especially for the categorization tasks which rely on more semantic information. Our new type of feature could also be helpful in Information Retrieval research.