Extending the single words-based document model: a comparison of bigrams and 2-itemsets

Authors:
Roman Tesar;Vaclav Strnad;Karel Jezek;Massimo Poesio
Affiliations:
University of West Bohemia;University of West Bohemia;University of West Bohemia;University of Essex
Venue:
Proceedings of the 2006 ACM symposium on Document engineering
Year:
2006

Citing 18
Cited 4

Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
Web document clustering: a feasibility demonstration

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Approximate statistical tests for comparing supervised classification learning algorithms

Neural Computation
Extending naïve Bayes classifiers using long itemsets

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Scalable association-based text classification

Proceedings of the ninth international conference on Information and knowledge management
Real world performance of association rule algorithms

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
High-performing feature selection for text classification

Proceedings of the eleventh international conference on Information and knowledge management
The use of bigrams to enhance text categorization

Information Processing and Management: an International Journal
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Feature Selection for Unbalanced Class Distribution and Naive Bayes

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Text Document Categorization by Term Association

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Statistical Phrases in Automated Text Categorization

Statistical Phrases in Automated Text Categorization
Using Association Features to Enhance the Performance of Naïve Bayes Text Classifier

ICCIMA '03 Proceedings of the 5th International Conference on Computational Intelligence and Multimedia Applications
Distributional word clusters vs. words for text categorization

The Journal of Machine Learning Research
An adaptive k-nearest neighbor text categorization strategy

ACM Transactions on Asian Language Information Processing (TALIP)
Text classification with kernels on the multinomial manifold

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
A new feature selection score for multinomial naive Bayes text classification based on KL-divergence

ACLdemo '04 Proceedings of the ACL 2004 on Interactive poster and demonstration sessions
Feature weighting for co-occurrence-based classification of words

COLING '04 Proceedings of the 20th international conference on Computational Linguistics

Word co-occurrence features for text classification

Information Systems
Building a topic hierarchy using the bag-of-related-words representation

Proceedings of the 11th ACM symposium on Document engineering
The influence of collocation segmentation and top 10 items to keyword assignment performance

CICLing'10 Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing
Discovering relevant features for effective query formulation

IRFC'12 Proceedings of the 5th conference on Multidisciplinary Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

The basic approach in text categorization is to represent documents by single words. However, often other features are utilized to achieve better classification results. In this paper, our attention is focused on bigrams and 2-itemsets. We compare the performance improvement in terms of classification accuracy when these features are used to extend the single words-based document representation on two standard text corpora: Reuters-21578 and 20 Newsgroups. For this comparison we use the multinomial Naive Bayes classifier and five different feature selection approaches. Algorithms for bigrams and 2-itemsets discovery are presented as well. Our results show a statistically significant improvement when bigrams and also 2-itemsets are incorporated. However, in the case of 2-itemsets it is important to use an appropriate feature selection method. On the other hand, even when a simple feature selection approach is applied to discover bigrams the classification accuracy improves. The conclusion is that, in our case, it is not very effective to extend document representation with 2-itemsets because bigrams achieve better results and discovering them is less resource-consuming.