Mixture model and MDSDCA for textual data

Authors:
Faryel Allouti;Mohamed Nadif;Le Thi Hoai An;Benoît Otjacques
Affiliations:
LIPADE, UFR MI, Paris Descartes University, Paris, France;LIPADE, UFR MI, Paris Descartes University, Paris, France;LITA, UFR MIM, Paul Verlaine University of Metz, Metz, France;Public Research Center-Gabriel Lippmann, Informatics, Systems and Collaboration Department, Belvaux, Luxembourg
Venue:
CDVE'09 Proceedings of the 6th international conference on Cooperative design, visualization, and engineering
Year:
2009

Citing 4
Cited 0

A Classification EM algorithm for clustering and two stochastic versions

Computational Statistics & Data Analysis - Special issue on optimization techniques in statistics
Mat'Graph: transformation matricielle de graphe pour visualiser des échanges électroniques

IHM 2005 Proceedings of the 17th international conference on Francophone sur l'Interaction Homme-Machine
Visualisation du parcours des fichiers attachés aux messages électroniques

Proceedings of the 20th International Conference of the Association Francophone d'Interaction Homme-Machine
Thread arcs: an email thread visualization

INFOVIS'03 Proceedings of the Ninth annual IEEE conference on Information visualization

Quantified Score

Hi-index	0.01

Visualization

Abstract

E-mailing has become an essential component of cooperation in business. Consequently, the large number of messages manually produced or automatically generated can rapidly cause information overflow for users. Many research projects have examined this issue but surprisingly few have tackled the problem of the files attached to e-mails that, in many cases, contain a substantial part of the semantics of the message. This paper considers this specific topic and focuses on the problem of clustering and visualization of attached files. Relying on the multinomial mixture model, we used the Classification EM algorithm (CEM) to cluster the set of files, and MDSDCA to visualize the obtained classes of documents. Like the Multidimensional Scaling method, the aim of the MDSDCA algorithm based on the Difference of Convex functions is to optimize the stress criterion. As MDSDCA is iterative, we propose an initialization approach to avoid starting with random values. Experiments are investigated using simulations and textual data.