Efficient visualization of document streams

Authors:
Miha Grčar;Vid Podpečan;Matjaž Juršič;Nada Lavrač
Affiliations:
Jožef Stefan Institute, Ljubljana, Slovenia;Jožef Stefan Institute, Ljubljana, Slovenia;Jožef Stefan Institute, Ljubljana, Slovenia;Jožef Stefan Institute, Ljubljana, Slovenia and University of Nova Gorica, Nova Gorica, Slovenia
Venue:
DS'10 Proceedings of the 13th international conference on Discovery science
Year:
2010

Citing 6
Cited 0

Algorithm 583: LSQR: Sparse Linear Equations and Least Squares Problems

ACM Transactions on Mathematical Software (TOMS)
ThemeRiver: Visualizing Theme Changes over Time

INFOVIS '00 Proceedings of the IEEE Symposium on Information Vizualization 2000
Least-Squares Meshes

SMI '04 Proceedings of the Shape Modeling International 2004
Visualizing Live Text Streams Using Motion and Temporal Pooling

IEEE Computer Graphics and Applications
Visual Mapping of Text Collections through a Fast High Precision Projection Technique

IV '06 Proceedings of the conference on Information Visualization
Graph drawing by stress majorization

GD'04 Proceedings of the 12th international conference on Graph Drawing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In machine learning and data mining, multidimensional scaling (MDS) and MDS-like methods are extensively used for dimensionality reduction and for gaining insights into overwhelming amounts of data through visualization. With the growth of the Web and activities of Web users, the amount of data not only grows exponentially but is also becoming available in the form of streams, where new data instances constantly flow into the system, requiring the algorithm to update the model in near-real time. This paper presents an algorithm for document stream visualization through a MDS-like distance-preserving projection onto a 2D canvas. The visualization algorithm is essentially a pipeline employing several methods from machine learning. Experimental verification shows that each stage of the pipeline is able to process a batch of documents in constant time. It is shown that in the experimental setting with a limited buffer capacity and a constant document batch size, it is possible to process roughly 2.5 documents per second which corresponds to approximately 25% of the entire blogosphere rate and should be sufficient for most real-life applications.