Approximate entity extraction in temporal databases

  • Authors:
  • Wei Lu;Gabriel Pui Fung;Xiaoyong Du;Xiaofang Zhou;Lijiang Chen;Ke Deng

  • Affiliations:
  • School of Information, Renmin University of China, Beijing, China 100872 and Key Labs of Data Engineering and Knowledge Engineering, Ministry of Education, Beijing, China;Data Mining and Machine Learning Group, Arizona State University, Tempe, USA;School of Information, Renmin University of China, Beijing, China 100872 and Key Labs of Data Engineering and Knowledge Engineering, Ministry of Education, Beijing, China;School of Information, Renmin University of China, Beijing, China 100872 and Key Labs of Data Engineering and Knowledge Engineering, Ministry of Education, Beijing, China and School of ITEE, The U ...;Department of Computer Science, Peking University, Beijing, China 100872;School of ITEE, The University of Queensland, Brisbane, Australia

  • Venue:
  • World Wide Web
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

We study the problem of efficiently extracting K entities, in a temporal database, which are most similar to a given search query. This problem is well studied in relational databases, where each entity is represented as a single record and there exist a variety of methods to define the similarity between a record and the search query. However, in temporal databases, each entity is represented as a sequence of historical records. How to properly define the similarity of each entity in the temporal database still remains an open problem. The main challenging is that, when a user issues a search query for an entity, he or she is prone to mix up information of the same entity at different time points. As a result, methods, which are used in relational databases based on record granularity, cannot work any further. Instead, we regard each entity as a set of "virtual records", where attribute values of a "virtual record" can be from different records of the same entity. In this paper, we propose a novel evaluation model, based on which the similarity between each "virtual record" and the query can be effectively quantified, and the maximum similarity of its "virtual records" is taken as the similarity of an entity. For each entity, as the number of its "virtual records" is exponentially large, calculating the similarity of the entity is challenging. As a result, we further propose a Dominating Tree Algorithm (DTA), which is based on the bounding-pruning-refining strategy, to efficiently extract K entities with greatest similarities. We conduct extensive experiments on both real and synthetic datasets. The encouraging results show that our model for defining the similarity between each entity and the search query is effective, and the proposed DTA can perform at least two orders of magnitude improvement on the performance comparing with the naive approach.