The use of bigrams to enhance text categorization

Authors:
Chade-Meng Tan;Yuan-Fang Wang;Chan-Do Lee
Affiliations:
Google Inc., 2400 Bayshore Pkwy, Mountain View, CA;Department of Computer Science, University of California, Santa Barbara, CA;Department of Information and Communication Engineering, Taejon University, Taejon 300-716, South Korea
Venue:
Information Processing and Management: an International Journal
Year:
2002

Citing 11
Cited 32

An evaluation of phrasal and clustered representations on a text categorization task

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Automated learning of decision rules for text categorization

ACM Transactions on Information Systems (TOIS)
A comparison of classifiers and document representations for the routing problem

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Boosting and Rocchio applied to text filtering

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Approximate statistical tests for comparing supervised classification learning algorithms

Neural Computation
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Machine learning in automated text categorisation

Machine learning in automated text categorisation
Representation and Learning in Information Retrieval

Representation and Learning in Information Retrieval
Feature selection and feature extraction for text categorization

HLT '91 Proceedings of the workshop on Speech and Natural Language

Classification of Web Documents Using a Graph Model

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
A new differential LSI space-based probabilistic document classifier

Information Processing Letters
Maximum Entropy Models with Inequality Constraints: A Case Study on Text Categorization

Machine Learning
Extending the single words-based document model: a comparison of bigrams and 2-itemsets

Proceedings of the 2006 ACM symposium on Document engineering
Language model-based document clustering using random walks

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
Contextual feature selection for text classification

Information Processing and Management: an International Journal - Special issue: AIRS2005: Information retrieval research in Asia
A characterization of wordnet features in Boolean models for text classification

AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
Evolving Lucene search queries for text classification

Proceedings of the 9th annual conference on Genetic and evolutionary computation
Text classification using sentential frequent itemsets

Journal of Computer Science and Technology
Dimensionality reduction of features for text categorization

ACST'07 Proceedings of the third conference on IASTED International Conference: Advances in Computer Science and Technology
Using the shape recovery method to evaluate indexing techniques

Journal of the American Society for Information Science and Technology
Exploring hedge identification in biomedical literature

Journal of Biomedical Informatics
A New Type of Feature --- Loose N-Gram Feature in Text Categorization

IbPRIA '07 Proceedings of the 3rd Iberian conference on Pattern Recognition and Image Analysis, Part I
AutoPCS: A Phrase-Based Text Categorization System for Similar Texts

APWeb/WAIM '09 Proceedings of the Joint International Conferences on Advances in Data and Web Management
Terminology Extraction from Log Files

DEXA '09 Proceedings of the 20th International Conference on Database and Expert Systems Applications
Is unlabeled data suitable for multiclass SVM-based web page classification?

SemiSupLearn '09 Proceedings of the NAACL HLT 2009 Workshop on Semi-Supervised Learning for Natural Language Processing
Mining e-contract documents to classify clauses

Proceedings of the Third Annual ACM Bangalore Conference
Definition extraction using linguistic and structural features

WDE '09 Proceedings of the 1st Workshop on Definition Extraction
Modeling perspective using adaptor grammars

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
A coarse-to-fine framework to efficiently thwart plagiarism

Pattern Recognition
Word co-occurrence features for text classification

Information Systems
Mining association language patterns using a distributional semantic model for negative life event classification

Journal of Biomedical Informatics
Building a topic hierarchy using the bag-of-related-words representation

Proceedings of the 11th ACM symposium on Document engineering
A hybrid text classification system using sentential frequent itemsets

CIS'05 Proceedings of the 2005 international conference on Computational Intelligence and Security - Volume Part I
A multi-classifier system for text categorization

Proceedings of the 2011 ACM Symposium on Research in Applied Computation
Filtering contents with bigrams and named entities to improve text classification

AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology
Extended bi-gram features in text categorization

IbPRIA'05 Proceedings of the Second Iberian conference on Pattern Recognition and Image Analysis - Volume Part II
Evolving rules for document classification

EuroGP'05 Proceedings of the 8th European conference on Genetic Programming
Cross-discourse development of supervised sentiment analysis in the clinical domain

WASSA '12 Proceedings of the 3rd Workshop in Computational Approaches to Subjectivity and Sentiment Analysis
Bag of spatio-visual words for context inference in scene classification

Pattern Recognition
Automatic text classification to support systematic reviews in medicine

Expert Systems with Applications: An International Journal
Classification of major construction materials in construction environments using ensemble classifiers

Advanced Engineering Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we present an efficient text categorization algorithm that generates bigrams selectively by looking for ones that have an especially good chance of being useful. The algorithm uses the information gain metric, combined with various frequency thresholds. The bigrams, along with unigrams, are then given as features to two different classifiers: Naïve Bayes and maximum entropy. The experimental results suggest that the bigrams can substantially raise the quality of feature sets, showing increases in the break-even points and F1 measures. The McNemar test shows that in most categories the increases are very significant. Upon close examination of the algorithm, we concluded that the algorithm is most successful in correctly classifying more positive documents, but may cause more negative documents to be classified incorrectly.