Approximate matching of hierarchical data using pq-grams

Authors:
Nikolaus Augsten;Michael Böhlen;Johann Gamper
Affiliations:
Free University of Bozen-Bolzano, Bozen, Italy;Free University of Bozen-Bolzano, Bozen, Italy;Free University of Bozen-Bolzano, Bozen, Italy
Venue:
VLDB '05 Proceedings of the 31st international conference on Very large data bases
Year:
2005

Citing 19
Cited 25

Simple fast algorithms for the editing distance between trees and related problems

SIAM Journal on Computing
Identifying syntactic differences between two programs

Software—Practice & Experience
Approximate string-matching with q-grams and maximal matches

Theoretical Computer Science - Selected papers of the Combinatorial Pattern Matching School
Alignment of trees: an alternative to tree edit

Theoretical Computer Science
Change detection in hierarchically structured information

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
The Tree-to-Tree Correction Problem

Journal of the ACM (JACM)
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
On supporting containment queries in relational database management systems

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
New algorithm for ordered tree-to-tree correction problem

Journal of Algorithms
Approximate XML joins

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Holistic twig joins: optimal XML pattern matching

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
A comprehensive XQuery to SQL translation using dynamic interval encoding

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Structural Joins: A Primitive for Efficient XML Query Pattern Matching

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Joe Celko's SQL for Smarties: Trees and Hierarchies

Joe Celko's SQL for Smarties: Trees and Hierarchies
Approximate XML query answers

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
An Efficient Algorithm to Compute Differences between Structured Documents

IEEE Transactions on Knowledge and Data Engineering
XML stream processing using tree-edit distance embeddings

ACM Transactions on Database Systems (TODS) - Special Issue: SIGMOD/PODS 2003
Holistic twig joins on indexed XML documents

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29

An incrementally maintainable index for approximate lookups in hierarchical data

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
The power of two min-hashes for similarity search among hierarchical data objects

Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Efficient Similarity Search for Tree-Structured Data

SSDBM '08 Proceedings of the 20th international conference on Scientific and Statistical Database Management
Evaluating Performance and Quality of XML-Based Similarity Joins

ADBIS '08 Proceedings of the 12th East European conference on Advances in Databases and Information Systems
A Tree Distance Function Based on Multi-sets

New Frontiers in Applied Data Mining
Sibling Distance for Rooted Labeled Trees

New Frontiers in Applied Data Mining
A cluster-based approach to XML similarity joins

IDEAS '09 Proceedings of the 2009 International Database Engineering & Applications Symposium
A system for detecting xml similarity in content and structure using relational database

Proceedings of the 18th ACM conference on Information and knowledge management
The pq-gram distance between ordered labeled trees

ACM Transactions on Database Systems (TODS)
Comparing stars: on approximating graph edit distance

Proceedings of the VLDB Endowment
XML-SIM: Structure and Content Semantic Similarity Detection Using Keys

OTM '09 Proceedings of the Confederated International Conferences, CoopIS, DOA, IS, and ODBASE 2009 on On the Move to Meaningful Internet Systems: Part II
The paths more taken: matching DOM trees to search logs for accurate webpage clustering

Proceedings of the 19th international conference on World wide web
XML: some papers in a haystack

ACM SIGMOD Record
GRAMS3: an efficient framework for XML structural similarity search

DASFAA'10 Proceedings of the 15th international conference on Database systems for advanced applications
XML structural similarity search using mapreduce

WAIM'10 Proceedings of the 11th international conference on Web-age information management
Approximate joins for XML using g-string

XSym'10 Proceedings of the 7th international XML database conference on Database and XML technologies
pq-hash: an efficient method for approximate XML joins

WAIM'10 Proceedings of the 2010 international conference on Web-age information management
Evolutionary taxonomy construction from dynamic tag space

WISE'10 Proceedings of the 11th international conference on Web information systems engineering
No tag, a little nesting, and great XML keyword search

AIRS'06 Proceedings of the Third Asia conference on Information Retrieval Technology
KCAM: concentrating on structural similarity for XML fragments

WAIM '06 Proceedings of the 7th international conference on Advances in Web-Age Information Management
Similarity join on XML based on k-generation set distance

WAIM'11 Proceedings of the 2011 international conference on Web-Age Information Management
Measuring structural similarity of semistructured data based on information-theoretic approaches

The VLDB Journal — The International Journal on Very Large Data Bases
What is the IQ of your data transformation system?

Proceedings of the 21st ACM international conference on Information and knowledge management
Synthetising changes in XML documents as PULs

Proceedings of the VLDB Endowment
A survey on tree edit distance lower bound estimation techniques for similarity join on XML data

ACM SIGMOD Record

Quantified Score

Hi-index	0.00

Visualization

Abstract

When integrating data from autonomous sources, exact matches of data items that represent the same real world object often fail due to a lack of common keys. Yet in many cases structural information is available and can be used to match such data. As a running example we use residential address information. Addresses are hierarchical structures and are present in many databases. Often they are the best, if not only, relationship between autonomous data sources. Typically the matching has to be approximate since the representations in the sources differ.We propose pq-grams to approximately match hierarchical information from autonomous sources. We define the pq-gram distance between ordered labeled trees as an effective and efficient approximation of the well-known tree edit distance. We analyze the properties of the pq-gram distance and compare it with the edit distance and alternative approximations. Experiments with synthetic and real world data confirm the analytic results and the scalability of our approach.