XML structural similarity search using mapreduce

Authors:
Peisen Yuan;Chaofeng Sha;Xiaoling Wang;Bin Yang;Aoying Zhou;Su Yang
Affiliations:
School of Computer Science, Fudan University, P.R.C and Shanghai Key Laboratory of Intelligent Information Processing, P.R.C;School of Computer Science, Fudan University, P.R.C and Shanghai Key Laboratory of Intelligent Information Processing, P.R.C;Shanghai Key Laboratory of Trustworthy Computing, Software Engineering Institute, East China Normal University, P.R.C;School of Computer Science, Fudan University, P.R.C and Shanghai Key Laboratory of Intelligent Information Processing, P.R.C;Shanghai Key Laboratory of Intelligent Information Processing, P.R.C and Shanghai Key Laboratory of Trustworthy Computing, Software Engineering Institute, East China Normal University, P.R.C;School of Computer Science, Fudan University, P.R.C and Shanghai Key Laboratory of Intelligent Information Processing, P.R.C
Venue:
WAIM'10 Proceedings of the 11th international conference on Web-age information management
Year:
2010

Citing 17
Cited 0

Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Min-wise independent permutations

Journal of Computer and System Sciences - 30th annual ACM symposium on theory of computing
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Alignment of Trees - An Alternative to Tree Edit

CPM '94 Proceedings of the 5th Annual Symposium on Combinatorial Pattern Matching
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
A bag of paths model for measuring structural similarity in Web documents

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Similarity evaluation on tree-structured data

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Approximate matching of hierarchical data using pq-grams

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Detecting near-duplicates for web crawling

Proceedings of the 16th international conference on World Wide Web
Google news personalization: scalable online collaborative filtering

Proceedings of the 16th international conference on World Wide Web
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Scalable near identical image and shot detection

Proceedings of the 6th ACM international conference on Image and video retrieval
The Active XML project: an overview

The VLDB Journal — The International Journal on Very Large Data Bases
Distributed XML Processing

APWeb/WAIM '09 Proceedings of the Joint International Conferences on Advances in Data and Web Management
GRAMS3: an efficient framework for XML structural similarity search

DASFAA'10 Proceedings of the 15th international conference on Database systems for advanced applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

XML is a de-facto standard for web data exchange and information representation. Efficient management of these large volumes of XML data brings challenges to conventional technique. To cope with large scale data, MapReduce computing framework as an efficient solution has attracted more and more attention in the database community recently. In this paper, an efficient and scalable framework is proposed for XML structural similarity search on large cluster with MapReduce. First, sub-structures of XML structure are extracted from large XML corpus located on a large cluster in parallel. Then Min-Hashing and locality sensitive hashing techniques are developed on the distributed and parallel computing framework for efficient structural similarity search processing. An empirical study on the cluster with real large datasets demonstrates the effectiveness and efficiency of our approach.