Style-based similarity search for office XML documents

  • Authors:
  • Yousuke Watanabe;Hidetaka Kamigaito;Haruo Yokota

  • Affiliations:
  • Tokyo Institute of Technology, Tokyo, Japan;Tokyo Institute of Technology, Tokyo, Japan;Tokyo Institute of Technology, Tokyo, Japan

  • Venue:
  • Proceedings of the 14th International Conference on Information Integration and Web-based Applications & Services
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Recent office documents follow an XML archive format, so they consist of multiple XML files. XML files in office documents include information about page structures and styles such as font, color and position. But, existing text-based search engines do not focus on structure and style of documents. By utilizing them, we can achieve similarity search for office documents based on structures and styles. We propose SOS, a similarity search method based on structures and styles of office documents. To compute a similarity value between office documents, we have to compute similarity values between multiple pairs of XML files in the documents. We also propose LAX+, which is an algorithm to calculate a similarity value for a pair of XML files, by extending existing XML leaf node clustering algorithm. In our experiments, we use docx, xlsx and pptx files and evaluate SOS and LAX+ by precision and recall.