Lightweight integration of IR and DB for scalable hybrid search with integrated ranking support

  • Authors:
  • Haofen Wang;Thanh Tran;Chang Liu;Linyun Fu

  • Affiliations:
  • Shanghai Jiao Tong University, Shanghai 200240, China;Institute AIFB, Universität Karlsruhe, D-76128 Karlsruhe, Germany;Shanghai Jiao Tong University, Shanghai 200240, China;Shanghai Jiao Tong University, Shanghai 200240, China

  • Venue:
  • Web Semantics: Science, Services and Agents on the World Wide Web
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

The Web contains a large amount of documents and an increasing quantity of structured data in the form of RDF triples. Many of these triples are annotations associated with documents. While structured queries constitute the principal means to retrieve structured data, keyword queries are typically used for document retrieval. Clearly, a form of hybrid search that seamlessly integrates these formalisms to query both textual and structured data can address more complex information needs. However, hybrid search on the large scale Web environment faces several challenges. First, there is a need for repositories that can store and index a large amount of semantic data as well as textual data in documents, and manage them in an integrated way. Second, methods for hybrid query answering are needed to exploit the data from such an integrated repository. These methods should be fast and scalable, and in particular, they shall support flexible ranking schemes to return not all but only the most relevant results. In this paper, we present CE^2, an integrated solution that leverages mature information retrieval and database technologies to support large scale hybrid search. For scalable and integrated management of data, CE^2 integrates off-the-shelf database solutions with inverted indexes. Efficient hybrid query processing is supported through novel data structures and algorithms which allow advanced ranking schemes to be tightly integrated. Furthermore, a concrete ranking scheme is proposed to take features from both textual and structured data into account. Experiments conducted on DBpedia and Wikipedia show that CE^2 can provide good performance in terms of both effectiveness and efficiency.