Inductive learning algorithms and representations for text categorization
Proceedings of the seventh international conference on Information and knowledge management
Web document clustering: a feasibility demonstration
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Extending naïve Bayes classifiers using long itemsets
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Scalable association-based text classification
Proceedings of the ninth international conference on Information and knowledge management
Real world performance of association rule algorithms
Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
High-performing feature selection for text classification
Proceedings of the eleventh international conference on Information and knowledge management
The use of bigrams to enhance text categorization
Information Processing and Management: an International Journal
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Feature Selection for Unbalanced Class Distribution and Naive Bayes
ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Text Document Categorization by Term Association
ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Statistical Phrases in Automated Text Categorization
Statistical Phrases in Automated Text Categorization
Using Association Features to Enhance the Performance of Naïve Bayes Text Classifier
ICCIMA '03 Proceedings of the 5th International Conference on Computational Intelligence and Multimedia Applications
Distributional word clusters vs. words for text categorization
The Journal of Machine Learning Research
An adaptive k-nearest neighbor text categorization strategy
ACM Transactions on Asian Language Information Processing (TALIP)
Text classification with kernels on the multinomial manifold
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
A new feature selection score for multinomial naive Bayes text classification based on KL-divergence
ACLdemo '04 Proceedings of the ACL 2004 on Interactive poster and demonstration sessions
Feature weighting for co-occurrence-based classification of words
COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Word co-occurrence features for text classification
Information Systems
Building a topic hierarchy using the bag-of-related-words representation
Proceedings of the 11th ACM symposium on Document engineering
The influence of collocation segmentation and top 10 items to keyword assignment performance
CICLing'10 Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing
Discovering relevant features for effective query formulation
IRFC'12 Proceedings of the 5th conference on Multidisciplinary Information Retrieval
Hi-index | 0.00 |
The basic approach in text categorization is to represent documents by single words. However, often other features are utilized to achieve better classification results. In this paper, our attention is focused on bigrams and 2-itemsets. We compare the performance improvement in terms of classification accuracy when these features are used to extend the single words-based document representation on two standard text corpora: Reuters-21578 and 20 Newsgroups. For this comparison we use the multinomial Naive Bayes classifier and five different feature selection approaches. Algorithms for bigrams and 2-itemsets discovery are presented as well. Our results show a statistically significant improvement when bigrams and also 2-itemsets are incorporated. However, in the case of 2-itemsets it is important to use an appropriate feature selection method. On the other hand, even when a simple feature selection approach is applied to discover bigrams the classification accuracy improves. The conclusion is that, in our case, it is not very effective to extend document representation with 2-itemsets because bigrams achieve better results and discovering them is less resource-consuming.