Pattern matching algorithms
Indexing and Querying XML Data for Regular Path Expressions
Proceedings of the 27th International Conference on Very Large Data Bases
ViST: a dynamic index method for querying XML data by tree structures
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
PRIX: Indexing And Querying XML Using Prüfer Sequences
ICDE '04 Proceedings of the 20th International Conference on Data Engineering
A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML
ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Bloom Filter-Based XML Packets Filtering for Millions of Path Queries
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
A Primitive Operator for Similarity Joins in Data Cleaning
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
A Path-sequence Based Discrimination for Subtree Matching in Approximate XML Joins
ICDEW '06 Proceedings of the 22nd International Conference on Data Engineering Workshops
Hi-index | 0.00 |
This paper proposes a scheme for similarity join over XML data based on XML data serialization and subsequent similarity matching over XML node subsequences. With the recent explosive diffusion of XML, great volumes of electronic data are now marked up with XML. As a consequence, a growing amount of XML data represents similar contents, but with dissimilar structures. To extract as much information as possible from this heterogeneous information, similarity join has been used. Our proposed similarity join for XML data can be summarized as follows: 1) we serialize XML data as XML node sequences; 2) we extract semantically/structurally coherent subsequences; 3) we filter out dissimilar subsequences using textual information; and 4) we extract pairs of subsequences as the final result by checking structural similarity. The above process is costly to execute. To make it scalable against large document sets, we use Bloom filter to speed up text similarity computation. We show the feasibility of the proposed scheme by experiments.