Term-weighting approaches in automatic text retrieval
Information Processing and Management: an International Journal
The Tree-to-Tree Correction Problem
Journal of the ACM (JACM)
Introduction to Modern Information Retrieval
Introduction to Modern Information Retrieval
Measuring the structural similarity of semistructured documents using entropy
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
A Renewed Matrix Model for XML Data
ISDA '08 Proceedings of the 2008 Eighth International Conference on Intelligent Systems Design and Applications - Volume 02
XML Data Integration Based on Content and Structure Similarity Using Keys
OTM '08 Proceedings of the OTM 2008 Confederated International Conferences, CoopIS, DOA, GADA, IS, and ODBASE 2008. Part I on On the Move to Meaningful Internet Systems:
An optimal decomposition algorithm for tree edit distance
ACM Transactions on Algorithms (TALG)
LAX: an efficient approximate XML join based on clustered leaf nodes for XML data integration
BNCOD'05 Proceedings of the 22nd British National conference on Databases: enterprise, Skills and Innovation
Survey: An overview on XML similarity: Background, current trends and future directions
Computer Science Review
Hi-index | 0.00 |
Recent office documents follow an XML archive format, so they consist of multiple XML files. XML files in office documents include information about page structures and styles such as font, color and position. But, existing text-based search engines do not focus on structure and style of documents. By utilizing them, we can achieve similarity search for office documents based on structures and styles. We propose SOS, a similarity search method based on structures and styles of office documents. To compute a similarity value between office documents, we have to compute similarity values between multiple pairs of XML files in the documents. We also propose LAX+, which is an algorithm to calculate a similarity value for a pair of XML files, by extending existing XML leaf node clustering algorithm. In our experiments, we use docx, xlsx and pptx files and evaluate SOS and LAX+ by precision and recall.