Classification Based on Specific Vocabulary

Authors:
Jacques Savoy;Olena Zubaryeva
Affiliations:
-;-
Venue:
WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Year:
2011

Citing 11
Cited 0

Foundations of statistical natural language processing

Foundations of statistical natural language processing
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Machine Learning

Machine Learning
Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms

Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Feature-rich part-of-speech tagging with a cyclic dependency network

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
A framework for authorship identification of online messages: Writing-style features and classification techniques

Journal of the American Society for Information Science and Technology
Language and the Internet

Language and the Internet
Introduction to Information Retrieval

Introduction to Information Retrieval
Opinion Mining and Sentiment Analysis

Foundations and Trends in Information Retrieval
Automatically profiling the author of an anonymous text

Communications of the ACM - Inspiring Women in Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Assuming a binomial distribution for word occurrence, we propose computing a standardized Z score to define the specific vocabulary of a subset compared to that of the entire corpus. This approach is applied to weight terms characterizing a document (or a sample of texts). We then show how these Z score values can be used to derive an efficient categorization scheme. To evaluate this proposition we categorize speeches given by B. Obama as either electoral or presidential. The results tend to show that the suggested classification scheme performs better than a Support Vector Machine scheme, and a Naive Bayes classifier (10-fold cross validation).