XStreamCluster: an efficient algorithm for streaming XML data clustering

Authors:
Odysseas Papapetrou;Ling Chen
Affiliations:
L3S Research Center, University of Hannover, Germany;QCIS, University of Technology Sydney, Australia
Venue:
DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications - Volume Part I
Year:
2011

Citing 15
Cited 1

Algorithms for clustering data

Algorithms for clustering data
Min-wise independent permutations (extended abstract)

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Clustering data streams

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Streaming-Data Algorithms for High-Quality Clustering

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
An Efficient and Scalable Algorithm for Clustering XML Documents by Structure

IEEE Transactions on Knowledge and Data Engineering
Querying XML streams

The VLDB Journal — The International Journal on Very Large Data Bases
A framework for clustering evolving data streams

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Introduction to Information Retrieval

Introduction to Information Retrieval
A Framework for Clustering Massive-Domain Data Streams

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Sketch-Based Summarization of Ordered XML Streams

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Adaptive XML Tree Classification on Evolving Data Streams

ECML PKDD '09 Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part I
A methodology for clustering XML documents by structure

Information Systems
Cardinality estimation and dynamic length adaptation for Bloom filters

Distributed and Parallel Databases
Transforming XML trees for efficient classification and clustering

INEX'05 Proceedings of the 4th international conference on Initiative for the Evaluation of XML Retrieval

Eliminating the redundancy in blocking-based entity resolution methods

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries

Quantified Score

Hi-index	0.00

Visualization

Abstract

XML clustering finds many applications, ranging from storage to query processing. However, existing clustering algorithms focus on static XML collections, whereas modern information systems frequently deal with streaming XML data that needs to be processed online. Streaming XML clustering is a challenging task because of the high computational and space efficiency requirements implicated for online approaches. In this paper we propose XStreamCluster, which addresses the two challenges using a two-layered optimization. The bottom layer employs Bloom filters to encode the XML documents, providing a spaceefficient solution to memory usage. The top layer is based on Locality Sensitive Hashing and contributes to the computational efficiency. The theoretical analysis shows that the approximate solution of XStream-Cluster generates similarly good clusters as the exact solution, with high probability. The experimental results demonstrate that XStreamCluster improves both memory efficiency and computational time by at least an order of magnitude without affecting clustering quality, compared to its variants and a baseline approach.