SEDA: a system for search, exploration, discovery, and analysis of XML Data

  • Authors:
  • Andrey Balmin;Latha Colby;Emiran Curtmola;Quanzhong Li;Fatma Özcan;Sharath Srinivas;Zografoula Vagena

  • Affiliations:
  • IBM Almaden Research Center;IBM Almaden Research Center;UC San Diego;IBM Almaden Research Center;IBM Almaden Research Center;University of Maryland, College Park;Micorsoft Research

  • Venue:
  • Proceedings of the VLDB Endowment
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Keyword search in XML repositories is a powerful tool for interactive data exploration. Much work has recently been done on making XML search aware of relationship information embedded in XML document structure, but without a clear winner in all data and query scenarios. Furthermore, due to its imprecise nature, search results cannot easily be analyzed and summarized to gain more insights into the data. We address these shortcomings with SEDA: a system for Search, Exploration, Discovery, and Analysis of XML Data. SEDA is based on a paradigm of search and user interaction to help users start with simple keyword-style querying and perform rich analysis of XML data by leveraging both the content and structure of the data. SEDA is an interactive system that allows the user to refine her query iteratively to explore the XML data and discover interesting relationships. SEDA first employs a top-k algorithm to compute the most relevant top-k answers fast, and returns tuples of nodes ranked by relevance. SEDA provides several novel data structures and techniques for efficient top-k computation over graph-structured XML data. SEDA also computes all the contexts in which the query terms are found and all the connection paths that connect the query terms in the XML data. These two summaries enable the user to refine her query by disambiguating the contexts and connections relevant to her query. With the user feedback, the system has enough information to compute all query results, not just the top-k. From the complete results, SEDA automatically deduces a star schema, which is then instantiated with the query results and augmented with additional values required for a well-defined data cube. The tables computed at this step are input into an OLAP engine for further analysis.