Word co-occurrence features for text classification

Authors:
Fábio Figueiredo;Leonardo Rocha;Thierson Couto;Thiago Salles;Marcos André Gonçalves;Wagner Meira Jr.
Affiliations:
EconoInfo Research, Belo Horizonte, Brazil and Universidade Federal de Minas Gerais, Computer Science Department, Belo Horizonte, Brazil;Universidade Federal de São João Del Rei, Computer Science Department, São João Del Rei, Brazil;Universidade Federal de Goiás, Institute of Informatics, Goiínia, Brazil;Universidade Federal de Minas Gerais, Computer Science Department, Belo Horizonte, Brazil;Universidade Federal de Minas Gerais, Computer Science Department, Belo Horizonte, Brazil;Universidade Federal de Minas Gerais, Computer Science Department, Belo Horizonte, Brazil
Venue:
Information Systems
Year:
2011

Citing 23
Cited 6

An evaluation of phrasal and clustered representations on a text categorization task

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Automated learning of decision rules for text categorization

ACM Transactions on Information Systems (TOIS)
Evaluating and optimizing autonomous text classification systems

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
Making large-scale support vector machine learning practical

Advances in kernel methods
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization

Text databases & document management
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Machine Learning

Machine Learning
Classifying text documents by associating terms with text categories

ADC '02 Proceedings of the 13th Australasian database conference - Volume 5
The use of bigrams to enhance text categorization

Information Processing and Management: an International Journal
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Feature Engineering for Text Classification

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Automatic Web Rating: Filtering Obscene Content on the Web

ECDL '00 Proceedings of the 4th European Conference on Research and Advanced Technology for Digital Libraries
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research
An evaluation of statistical spam filtering techniques

ACM Transactions on Asian Language Information Processing (TALIP)
SAT-MOD: moderate itemset fittest for text classification

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Introduction to Data Mining, (First Edition)

Introduction to Data Mining, (First Edition)
A comparative study of citations and links in document classification

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Training linear SVMs in linear time

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Extending the single words-based document model: a comparison of bigrams and 2-itemsets

Proceedings of the 2006 ACM symposium on Document engineering
Relaxed online SVMs for spam filtering

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Considering re-occurring features in associative classifiers

PAKDD'05 Proceedings of the 9th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining

Clustering a very large number of textual unstructured customers' reviews in english

AIMSA'12 Proceedings of the 15th international conference on Artificial Intelligence: methodology, systems, and applications
A fuzzy conceptualization model for text mining with application in opinion polarity classification

Knowledge-Based Systems
Temporal contexts: Effective text classification in evolving document collections

Information Systems
Class-indexing-based term weighting for automatic text classification

Information Sciences: an International Journal
Discovering health-related knowledge in social media using ensembles of heterogeneous features

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Coupled attribute analysis on numerical data

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this article we propose a data treatment strategy to generate new discriminative features, called compound-features (or c-features), for the sake of text classification. These c-features are composed by terms that co-occur in documents without any restrictions on order or distance between terms within a document. This strategy precedes the classification task, in order to enhance documents with discriminative c-features. The idea is that, when c-features are used in conjunction with single-features, the ambiguity and noise inherent to their bag-of-words representation are reduced. We use c-features composed of two terms in order to make their usage computationally feasible while improving the classifier effectiveness. We test this approach with several classification algorithms and single-label multi-class text collections. Experimental results demonstrated gains in almost all evaluated scenarios, from the simplest algorithms such as kNN (13% gain in micro-average F"1 in the 20 Newsgroups collection) to the most complex one, the state-of-the-art SVM (10% gain in macro-average F"1 in the collection OHSUMED).