Optimization of sub-query processing in distributed data integration systems

  • Authors:
  • Gang Chen;Yongwei Wu;Jia Liu;Guangwen Yang;Weimin Zheng

  • Affiliations:
  • Department of Computer Science and Technology, Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, Beijing 100084, China;Department of Computer Science and Technology, Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, Beijing 100084, China;Department of Computer Science and Technology, Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, Beijing 100084, China;Department of Computer Science and Technology, Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, Beijing 100084, China;Department of Computer Science and Technology, Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, Beijing 100084, China

  • Venue:
  • Journal of Network and Computer Applications
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Data integration system (DIS) is becoming paramount when Cloud/Grid applications need to integrate and analyze data from geographically distributed data sources. DIS gathers data from multiple remote sources, integrates and analyzes the data to obtain a query result. As Clouds/Grids are distributed over wide-area networks, communication cost usually dominates overall query response time. Therefore we can expect that query performance can be improved by minimizing communication cost. In our method, DIS uses a data flow style query execution model. Each query plan is mapped to a group of @mEngines, each of which is a program corresponding to a particular operator. Thus, multiple sub-queries from concurrent queries are able to share @mEngines. We reconstruct these sub-queries to exploit overlapping data among them. As a result, all the sub-queries can obtain their results, and overall communication overhead can be reduced. Experimental results show that, when DIS runs a group of parameterized queries, our reconstructing algorithm can reduce the average query completion time by 32-48%; when DIS runs a group of non-parameterized queries, the average query completion time of queries can be reduced by 25-35%.