Style-based similarity search for office XML documents

Authors:
Yousuke Watanabe;Hidetaka Kamigaito;Haruo Yokota
Affiliations:
Tokyo Institute of Technology, Tokyo, Japan;Tokyo Institute of Technology, Tokyo, Japan;Tokyo Institute of Technology, Tokyo, Japan
Venue:
Proceedings of the 14th International Conference on Information Integration and Web-based Applications & Services
Year:
2012

Citing 9
Cited 0

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
The Tree-to-Tree Correction Problem

Journal of the ACM (JACM)
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Measuring the structural similarity of semistructured documents using entropy

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
A Renewed Matrix Model for XML Data

ISDA '08 Proceedings of the 2008 Eighth International Conference on Intelligent Systems Design and Applications - Volume 02
XML Data Integration Based on Content and Structure Similarity Using Keys

OTM '08 Proceedings of the OTM 2008 Confederated International Conferences, CoopIS, DOA, GADA, IS, and ODBASE 2008. Part I on On the Move to Meaningful Internet Systems:
An optimal decomposition algorithm for tree edit distance

ACM Transactions on Algorithms (TALG)
LAX: an efficient approximate XML join based on clustered leaf nodes for XML data integration

BNCOD'05 Proceedings of the 22nd British National conference on Databases: enterprise, Skills and Innovation
Survey: An overview on XML similarity: Background, current trends and future directions

Computer Science Review

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recent office documents follow an XML archive format, so they consist of multiple XML files. XML files in office documents include information about page structures and styles such as font, color and position. But, existing text-based search engines do not focus on structure and style of documents. By utilizing them, we can achieve similarity search for office documents based on structures and styles. We propose SOS, a similarity search method based on structures and styles of office documents. To compute a similarity value between office documents, we have to compute similarity values between multiple pairs of XML files in the documents. We also propose LAX+, which is an algorithm to calculate a similarity value for a pair of XML files, by extending existing XML leaf node clustering algorithm. In our experiments, we use docx, xlsx and pptx files and evaluate SOS and LAX+ by precision and recall.