Using scatterplots to understand and improve probabilistic models for text categorization and retrieval

  • Authors:
  • Giorgio Maria Di Nunzio

  • Affiliations:
  • Department of Information Engineering, University of Padua, Via Gradenigo 6/a, 35131 Padua, Italy and Tel./fax: +39 049 8277613.

  • Venue:
  • International Journal of Approximate Reasoning
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

The two-dimensional representation of documents which allows documents to be represented in a two-dimensional Cartesian plane has proved to be a valid visualization tool for Automated Text Categorization (ATC) for understanding the relationships between categories of textual documents, and to help users to visually audit the classifier and identify suspicious training data. This paper analyzes a specific use of this visualization approach in the case of the Naive Bayes (NB) model for text classification and the Binary Independence Model (BIM) for text retrieval. For text categorization, a reformulation of the equation for the decision of classification has to be written in such a way that each coordinate of a document is the sum of two addends: a variable component P(d|c"i), and a constant component P(c"i), the prior of the category. When plotted in the Cartesian plane according to this formulation, the documents that are constantly shifted along the x-axis and the y-axis can be seen. This effect of shifting is more or less evident according to which NB model, Bernoulli or multinomial, is chosen. For text retrieval, the same reformulation can be applied in the case of the BIM model. The visualization helps to understand the decisions that are taken to order the documents, in particular in the case of relevance feedback.