Document analysis and visualization with zero-inflated poisson

Authors:
Dora Alvarez;Hugo Hidalgo
Affiliations:
Centro de Investigación y de Educación Superior de Ensenada (CICESE), Ensenada, Mexico 22860;Centro de Investigación y de Educación Superior de Ensenada (CICESE), Ensenada, Mexico 22860
Venue:
Data Mining and Knowledge Discovery
Year:
2009

Citing 15
Cited 2

Self-organization and associative memory: 3rd edition

Self-organization and associative memory: 3rd edition
Zero-inflated Poisson regression, with an application to defects in manufacturing

Technometrics
Subsymbolic natural language processing: an integrated model of scripts, lexicon, and memory

Subsymbolic natural language processing: an integrated model of scripts, lexicon, and memory
GTM: the generative topographic mapping

Neural Computation
A Combined Latent Class and Trait Model for the Analysis and Visualization of Discrete Data

IEEE Transactions on Pattern Analysis and Machine Intelligence
Hierarchical GTM: Constructing Localized Nonlinear Projection Manifolds in a Principled Way

IEEE Transactions on Pattern Analysis and Machine Intelligence
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
On Clustering Validation Techniques

Journal of Intelligent Information Systems
Very Large Two-Level SOM for the Browsing of Newsgroups

ICANN 96 Proceedings of the 1996 International Conference on Artificial Neural Networks
Latent dirichlet allocation

The Journal of Machine Learning Research
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
A Nonlinear Mapping for Data Structure Analysis

IEEE Transactions on Computers
Two-way Poisson mixture models for simultaneous document classification and word clustering

Computational Statistics & Data Analysis
The topographic organization and visualization of binary data using multivariate-Bernoulli latent variable models

IEEE Transactions on Neural Networks
Artificial neural networks for feature extraction and multivariate data projection

IEEE Transactions on Neural Networks

Probabilistic self-organizing maps for qualitative data

Neural Networks
Probabilistic self-organizing maps for continuous data

IEEE Transactions on Neural Networks

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data visualization is aimed at obtaining a graphic representation of high dimensional information. A data projection over a lower dimensional space is pursued, looking for some structure on the projections. Among the several data projection based methods available, the Generative Topographic Mapping (GTM) has become an important probabilistic framework to model data. The application to document data requires a change in the original (Gaussian) model in order to consider binary or multinomial variables. There have been several modifications on GTM to consider this kind of data, but the resulting latent projections are all scattered on the visualization plane. A document visualization method is proposed in this paper, based on a generative probabilistic model consisting of a mixture of Zero-inflated Poisson distributions. The performance of the method is evaluated in terms of cluster forming for the latent projections with an index based on Fisher's classifier, and the topology preservation capability is measured with the Sammon's stress error. A comparison with the GTM implementation with Gaussian, multinomial and Poisson distributions and with a Latent Dirichlet model is presented, observing a greater performance for the proposed method. A graphic presentation of the projections is also provided, showing the advantage of the developed method in terms of visualization and class separation. A detailed analysis of some documents projected on the latent representation showed that most of the documents appearing away from the corresponding cluster could be identified as outliers.