Effective early termination techniques for text similarity join operator

Authors:
Selma Ayşe Özalp;Özgür Ulusoy
Affiliations:
Department of Industrial Engineering, Uludag University, Gorukle Bursa, Turkey;Department of Computer Engineering, Bilkent University, Bilkent Ankara, Turkey
Venue:
ISCIS'05 Proceedings of the 20th international conference on Computer and Information Sciences
Year:
2005

Citing 15
Cited 0

Automatic text processing

Automatic text processing
Self-indexing inverted files for fast text retrieval

ACM Transactions on Information Systems (TOIS)
Processing queries for first-few answers

CIKM '96 Proceedings of the fifth international conference on Information and knowledge management
Data integration using similarity joins and a word-based information representation language

ACM Transactions on Information Systems (TOIS)
Optimal aggregation algorithms for middleware

PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Vector-space ranking with effective early termination

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Minimal probing: supporting expensive predicates for top-k queries

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Performance Analysis of Three Text-Join Algorithms

IEEE Transactions on Knowledge and Data Engineering
Supporting Incremental Join Queries on Ranked Inputs

Proceedings of the 27th International Conference on Very Large Data Bases
Text joins in an RDBMS for web data integration

WWW '03 Proceedings of the 12th international conference on World Wide Web
Efficient Record Linkage in Large Data Sets

DASFAA '03 Proceedings of the Eighth International Conference on Database Systems for Advanced Applications
Metadata-based modeling of information resources on the Web

Journal of the American Society for Information Science and Technology
Efficient similarity-based operations for data integration

Data & Knowledge Engineering
Querying web metadata: Native score management and text support in databases

ACM Transactions on Database Systems (TODS)
Supporting top-K join queries in relational databases

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29

Quantified Score

Hi-index	0.00

Visualization

Abstract

Text similarity join operator joins two relations if their join attributes are textually similar to each other, and it has a variety of application domains including integration and querying of data from heterogeneous resources; cleansing of data; and mining of data. Although, the text similarity join operator is widely used, its processing is expensive due to the huge number of similarity computations performed. In this paper, we incorporate some short cut evaluation techniques from the Information Retrieval domain, namely Harman, quit, continue, and maximal similarity filter heuristics, into the previously proposed text similarity join algorithms to reduce the amount of similarity computations needed during the join operation. We experimentally evaluate the original and the heuristic based similarity join algorithms using real data obtained from the DBLP Bibliography database, and observe performance improvements with continue and maximal similarity filter heuristics.