Using scatterplots to understand and improve probabilistic models for text categorization and retrieval

Authors:
Giorgio Maria Di Nunzio
Affiliations:
Department of Information Engineering, University of Padua, Via Gradenigo 6/a, 35131 Padua, Italy and Tel./fax: +39 049 8277613.
Venue:
International Journal of Approximate Reasoning
Year:
2009

Citing 21
Cited 4

Some inconsistencies and misnomers in probabilistic information retrieval

SIGIR '91 Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval
Learning to classify text from labeled and unlabeled documents

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Visualizing the simple Baysian classifier

Information visualization in data mining and knowledge discovery
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Information Retrieval

Information Retrieval
Models in information retrieval

Lectures on information retrieval
Text Categorization Based on Regularized Linear Classification Methods

Information Retrieval
Information Visualization and Visual Data Mining

IEEE Transactions on Visualization and Computer Graphics
Visualization Techniques for Mining Large Databases: A Comparison

IEEE Transactions on Knowledge and Data Engineering
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Interactive Visualization and Navigation in Large Data Collections using the Hyperbolic Space

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Ontological user profiling in recommender systems

ACM Transactions on Information Systems (TOIS)
Text categorization for a comprehensive time-dependent benchmark

Information Processing and Management: an International Journal
In Defense of One-Vs-All Classification

The Journal of Machine Learning Research
Feature selection using linear classifier weights: interaction with classification models

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Nomograms for visualization of naive Bayesian classifier

PKDD '04 Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases
An initial evaluation of automated organization for digital library browsing

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Visual explanation of evidence in additive classifiers

IAAI'06 Proceedings of the 18th conference on Innovative applications of artificial intelligence - Volume 2
From visual data exploration to visual data mining: a survey

IEEE Transactions on Visualization and Computer Graphics
Self organization of a massive document collection

IEEE Transactions on Neural Networks

Automatic text categorization based on content analysis with cognitive situation models

Information Sciences: an International Journal
A visualization tool of probabilistic models for information access components

ECDL'09 Proceedings of the 13th European conference on Research and advanced technology for digital libraries
A visual tool for bayesian data analysis: the impact of smoothing on naive bayes text classifiers

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Nonlinear transformation of term frequencies for term weighting in text categorization

Engineering Applications of Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

The two-dimensional representation of documents which allows documents to be represented in a two-dimensional Cartesian plane has proved to be a valid visualization tool for Automated Text Categorization (ATC) for understanding the relationships between categories of textual documents, and to help users to visually audit the classifier and identify suspicious training data. This paper analyzes a specific use of this visualization approach in the case of the Naive Bayes (NB) model for text classification and the Binary Independence Model (BIM) for text retrieval. For text categorization, a reformulation of the equation for the decision of classification has to be written in such a way that each coordinate of a document is the sum of two addends: a variable component P(d|c"i), and a constant component P(c"i), the prior of the category. When plotted in the Cartesian plane according to this formulation, the documents that are constantly shifted along the x-axis and the y-axis can be seen. This effect of shifting is more or less evident according to which NB model, Bernoulli or multinomial, is chosen. For text retrieval, the same reformulation can be applied in the case of the BIM model. The visualization helps to understand the decisions that are taken to order the documents, in particular in the case of relevance feedback.