Approximate entity extraction in temporal databases

Authors:
Wei Lu;Gabriel Pui Fung;Xiaoyong Du;Xiaofang Zhou;Lijiang Chen;Ke Deng
Affiliations:
School of Information, Renmin University of China, Beijing, China 100872 and Key Labs of Data Engineering and Knowledge Engineering, Ministry of Education, Beijing, China;Data Mining and Machine Learning Group, Arizona State University, Tempe, USA;School of Information, Renmin University of China, Beijing, China 100872 and Key Labs of Data Engineering and Knowledge Engineering, Ministry of Education, Beijing, China;School of Information, Renmin University of China, Beijing, China 100872 and Key Labs of Data Engineering and Knowledge Engineering, Ministry of Education, Beijing, China and School of ITEE, The U ...;Department of Computer Science, Peking University, Beijing, China 100872;School of ITEE, The University of Queensland, Brisbane, Australia
Venue:
World Wide Web
Year:
2011

Citing 30
Cited 1

Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Integration of heterogeneous databases without common domains using queries based on textual similarity

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
Learning domain-independent string transformation weights for high accuracy object identification

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Text joins in an RDBMS for web data integration

WWW '03 Proceedings of the 12th international conference on World Wide Web
Temporal Data and the Relational Model

Temporal Data and the Relational Model
Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Integrating XML and Relational Database Systems

World Wide Web
An Intelligent Data Integration Approach for Collaborative Project Management in Virtual Enterprises

World Wide Web
A Primitive Operator for Similarity Joins in Data Cleaning

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
LinkClus: efficient clustering via heterogeneous semantic links

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Supporting Efficient Record Linkage for Large Data Sets Using Mapping Techniques

World Wide Web
Benchmarking declarative approximate selection predicates

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Eliminating fuzzy duplicates in data warehouses

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
COMA: a system for flexible combination of schema matching approaches

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Merging the results of approximate match operations

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
FASE: A Framework for Scalable Performance Prediction of HPC Systems and Applications

Simulation
VGRAM: improving performance of approximate queries on string collections using variable-length grams

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Example-driven design of efficient record matching queries

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Efficient similarity joins for near duplicate detection

Proceedings of the 17th international conference on World Wide Web
Cost-based variable-length-gram selection for string collections to support approximate queries efficiently

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Ed-Join: an efficient algorithm for similarity joins with edit distance constraints

Proceedings of the VLDB Endowment
Efficient Merging and Filtering Algorithms for Approximate String Searches

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Top-k Set Similarity Joins

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Efficient top-k algorithms for fuzzy search in string collections

Proceedings of the First International Workshop on Keyword Search on Structured Data
Efficient approximate entity extraction with edit distance constraints

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
A Wikipedia Matching Approach to Contextual Advertising

World Wide Web

Efficient top-K approximate searches against a relation with multiple attributes

World Wide Web

Quantified Score

Hi-index	0.00

Visualization

Abstract

We study the problem of efficiently extracting K entities, in a temporal database, which are most similar to a given search query. This problem is well studied in relational databases, where each entity is represented as a single record and there exist a variety of methods to define the similarity between a record and the search query. However, in temporal databases, each entity is represented as a sequence of historical records. How to properly define the similarity of each entity in the temporal database still remains an open problem. The main challenging is that, when a user issues a search query for an entity, he or she is prone to mix up information of the same entity at different time points. As a result, methods, which are used in relational databases based on record granularity, cannot work any further. Instead, we regard each entity as a set of "virtual records", where attribute values of a "virtual record" can be from different records of the same entity. In this paper, we propose a novel evaluation model, based on which the similarity between each "virtual record" and the query can be effectively quantified, and the maximum similarity of its "virtual records" is taken as the similarity of an entity. For each entity, as the number of its "virtual records" is exponentially large, calculating the similarity of the entity is challenging. As a result, we further propose a Dominating Tree Algorithm (DTA), which is based on the bounding-pruning-refining strategy, to efficiently extract K entities with greatest similarities. We conduct extensive experiments on both real and synthetic datasets. The encouraging results show that our model for defining the similarity between each entity and the search query is effective, and the proposed DTA can perform at least two orders of magnitude improvement on the performance comparing with the naive approach.