Algorithms for clustering data
Algorithms for clustering data
On the editing distance between unordered labeled trees
Information Processing Letters
Query evaluation: strategies and optimizations
Information Processing and Management: an International Journal
The Tree-to-Tree Correction Problem
Journal of the ACM (JACM)
Machine learning in automated text categorization
ACM Computing Surveys (CSUR)
Introduction to Modern Information Retrieval
Introduction to Modern Information Retrieval
XClust: clustering XML schemas for effective integration
Proceedings of the eleventh international conference on Information and knowledge management
Combining document representations for known-item search
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Searching XML documents via XML fragments
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
A bag of paths model for measuring structural similarity in Web documents
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient set joins on similarity predicates
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Fast Detection of XML Structural Similarity
IEEE Transactions on Knowledge and Data Engineering
DogmatiX tracks down duplicates in XML
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Approximate matching of hierarchical data using pq-grams
VLDB '05 Proceedings of the 31st international conference on Very large data bases
A Primitive Operator for Similarity Joins in Data Cleaning
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Integrating XML data sources using approximate joins
ACM Transactions on Database Systems (TODS)
Record linkage: similarity measures and algorithms
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Articulating information needs in XML query languages
ACM Transactions on Information Systems (TOIS)
An efficient infrastructure for native transactional XML processing
Data & Knowledge Engineering
Scaling up all pairs similarity search
Proceedings of the 16th international conference on World Wide Web
Benchmarking declarative approximate selection predicates
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Xproj: a framework for projected structural clustering of xml documents
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Comparison of Complete and Elementless Native Storage of XML Documents
IDEAS '07 Proceedings of the 11th International Database Engineering and Applications Symposium
Measuring the structural similarity of semistructured documents using entropy
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Proceedings of the Second ACM International Conference on Web Search and Data Mining
Approximate Joins for Data-Centric XML
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Fast Indexes and Algorithms for Set Similarity Selection Queries
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
A Decade of XML Data Management: An Industrial Experience Report from Oracle
ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Overview of the INEX 2008 XML Mining Track
Advances in Focused Retrieval
Ingredients for accurate, fast, and robust XML similarity joins
DEXA'11 Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part II
Leveraging the storage layer to support XML similarity joins in XDBMSs
ADBIS'12 Proceedings of the 16th East European conference on Advances in Databases and Information Systems
Hi-index | 0.00 |
A natural consequence of the widespread adoption of XML as standard for information representation and exchange is the redundant storage of large amounts of persistent XML documents. Compared to relational data tables, data represented in XML format can potentially be even more sensitive to data quality issues because structure, besides textual information, may cause variations in XML documents representing the same information entity. Therefore, correlating XML documents, which are similar in content an structure, is a fundamental operation. In this paper, we present an effective, flexible, and high-performance XML-based similarity join framework. We exploit structural summaries and clustering concepts to produce compact and high-quality XML document representations: our approach outperforms previous work both in terms of performance and accuracy. In this context, we explore different ways to weigh and combine evidence from textual and structural XML representations. Furthermore, we address user interaction, when the similarity framework is configured for a specific domain, and updatability of clustering information, when new documents enter datasets under consideration. We present a thorough experimental evaluation to validate our techniques in the context of a native XML DBMS.