Beyond TFIDF weighting for text categorization in the vector space model

Authors:
Pascal Soucy;Guy W. Mineau
Affiliations:
Coveo, Quebec, Canada;Université Laval, Québec, Canada
Venue:
IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Year:
2005

Citing 7
Cited 20

Training algorithms for linear text classifiers

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
A vector space model for automatic indexing

Communications of the ACM
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Text categorization using weight adjusted k-nearest neighbor classification (information retrieval)

Text categorization using weight adjusted k-nearest neighbor classification (information retrieval)
Supervised term weighting for automated text categorization

Proceedings of the 2003 ACM symposium on Applied computing

Raising the baseline for high-precision text classifiers

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Ontology-based context synchronization for ad hoc social collaborations

Knowledge-Based Systems
CLBCRA-Approach for Combination of Content-Based and Link-Based Ranking in Web Search

ADMA '07 Proceedings of the 3rd international conference on Advanced Data Mining and Applications
Classification techniques with minimal labelling effort and application to medical reports

International Journal of Data Mining and Bioinformatics
Improving Automatic Text Classification by Integrated Feature Analysis

IEICE - Transactions on Information and Systems
Combination of modified BPNN algorithms and an efficient feature selection method for text categorization

Information Processing and Management: an International Journal
Topic model methods for automatically identifying out-of-scope resources

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Accessing Positive and Negative Online Opinions

UAHCI '09 Proceedings of the 5th International Conference on Universal Access in Human-Computer Interaction. Part III: Applications and Services
A weighting approach for features based on real rough set

FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 6
A schema for ontology-based concept definition and identification

International Journal of Computer Applications in Technology
The ECIR 2010 large scale hierarchical classification workshop

ACM SIGIR Forum
A vector space model for subjectivity classification in Urdu aided by co-training

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Optimizing personalized retrieval system based on web ranking

CSR'06 Proceedings of the First international computer science conference on Theory and Applications
Automatic folder allocation system using Bayesian-support vector machines hybrid classification approach

Applied Intelligence
Nonlinear transformation of term frequencies for term weighting in text categorization

Engineering Applications of Artificial Intelligence
Ontology-Based genes similarity calculation with TF-IDF

ICICA'12 Proceedings of the Third international conference on Information Computing and Applications
Automatic classification of documents in cold-start scenarios

Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics
Comparison of text feature selection policies and using an adaptive framework

Expert Systems with Applications: An International Journal
Sentiment analysis on evolving social streams: how self-report imbalances can help

Proceedings of the 7th ACM international conference on Web search and data mining
A study of supervised term weighting scheme for sentiment analysis

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

KNN and SVM are two machine learning approaches to Text Categorization (TC) based on the Vector Space Model. In this model, borrowed from Information Retrieval, documents are represented as a vector where each component is associated with a particular word from the vocabulary. Traditionally, each component value is assigned using the information retrieval TFIDF measure. While this weighting method seems very appropriate for IR, it is not clear that it is the best choice for TC problems. Actually, this weighting method does not leverage the information implicitly contained in the categorization task to represent documents. In this paper, we introduce a new weighting method based on statistical estimation of the importance of a word for a specific categorization problem. This method also has the benefit to make feature selection implicit, since useless features for the categorization problem considered get a very small weight. Extensive experiments reported in the paper shows that this new weighting method improves significantly the classification accuracy as measured on many categorization tasks.