An evaluation of phrasal and clustered representations on a text categorization task
SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Automated learning of decision rules for text categorization
ACM Transactions on Information Systems (TOIS)
A comparison of classifiers and document representations for the routing problem
SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Boosting and Rocchio applied to text filtering
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
A re-examination of text categorization methods
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Modern Information Retrieval
Introduction to Modern Information Retrieval
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Machine learning in automated text categorisation
Machine learning in automated text categorisation
Representation and Learning in Information Retrieval
Representation and Learning in Information Retrieval
Feature selection and feature extraction for text categorization
HLT '91 Proceedings of the workshop on Speech and Natural Language
Classification of Web Documents Using a Graph Model
ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
A new differential LSI space-based probabilistic document classifier
Information Processing Letters
Extending the single words-based document model: a comparison of bigrams and 2-itemsets
Proceedings of the 2006 ACM symposium on Document engineering
Language model-based document clustering using random walks
HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
Contextual feature selection for text classification
Information Processing and Management: an International Journal - Special issue: AIRS2005: Information retrieval research in Asia
A characterization of wordnet features in Boolean models for text classification
AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
Evolving Lucene search queries for text classification
Proceedings of the 9th annual conference on Genetic and evolutionary computation
Text classification using sentential frequent itemsets
Journal of Computer Science and Technology
Dimensionality reduction of features for text categorization
ACST'07 Proceedings of the third conference on IASTED International Conference: Advances in Computer Science and Technology
Using the shape recovery method to evaluate indexing techniques
Journal of the American Society for Information Science and Technology
Exploring hedge identification in biomedical literature
Journal of Biomedical Informatics
A New Type of Feature --- Loose N-Gram Feature in Text Categorization
IbPRIA '07 Proceedings of the 3rd Iberian conference on Pattern Recognition and Image Analysis, Part I
AutoPCS: A Phrase-Based Text Categorization System for Similar Texts
APWeb/WAIM '09 Proceedings of the Joint International Conferences on Advances in Data and Web Management
Terminology Extraction from Log Files
DEXA '09 Proceedings of the 20th International Conference on Database and Expert Systems Applications
Is unlabeled data suitable for multiclass SVM-based web page classification?
SemiSupLearn '09 Proceedings of the NAACL HLT 2009 Workshop on Semi-Supervised Learning for Natural Language Processing
Mining e-contract documents to classify clauses
Proceedings of the Third Annual ACM Bangalore Conference
Definition extraction using linguistic and structural features
WDE '09 Proceedings of the 1st Workshop on Definition Extraction
Modeling perspective using adaptor grammars
EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
A coarse-to-fine framework to efficiently thwart plagiarism
Pattern Recognition
Word co-occurrence features for text classification
Information Systems
Journal of Biomedical Informatics
Building a topic hierarchy using the bag-of-related-words representation
Proceedings of the 11th ACM symposium on Document engineering
A hybrid text classification system using sentential frequent itemsets
CIS'05 Proceedings of the 2005 international conference on Computational Intelligence and Security - Volume Part I
A multi-classifier system for text categorization
Proceedings of the 2011 ACM Symposium on Research in Applied Computation
Filtering contents with bigrams and named entities to improve text classification
AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology
Extended bi-gram features in text categorization
IbPRIA'05 Proceedings of the Second Iberian conference on Pattern Recognition and Image Analysis - Volume Part II
Evolving rules for document classification
EuroGP'05 Proceedings of the 8th European conference on Genetic Programming
Cross-discourse development of supervised sentiment analysis in the clinical domain
WASSA '12 Proceedings of the 3rd Workshop in Computational Approaches to Subjectivity and Sentiment Analysis
Bag of spatio-visual words for context inference in scene classification
Pattern Recognition
Automatic text classification to support systematic reviews in medicine
Expert Systems with Applications: An International Journal
Advanced Engineering Informatics
Hi-index | 0.00 |
In this paper, we present an efficient text categorization algorithm that generates bigrams selectively by looking for ones that have an especially good chance of being useful. The algorithm uses the information gain metric, combined with various frequency thresholds. The bigrams, along with unigrams, are then given as features to two different classifiers: Naïve Bayes and maximum entropy. The experimental results suggest that the bigrams can substantially raise the quality of feature sets, showing increases in the break-even points and F1 measures. The McNemar test shows that in most categories the increases are very significant. Upon close examination of the algorithm, we concluded that the algorithm is most successful in correctly classifying more positive documents, but may cause more negative documents to be classified incorrectly.