SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Fast and effective text mining using linear-time document clustering
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
A guided tour to approximate string matching
ACM Computing Surveys (CSUR)
Introduction to Modern Information Retrieval
Introduction to Modern Information Retrieval
Entity Identification in Database Integration
Proceedings of the Ninth International Conference on Data Engineering
Declarative Data Cleaning: Language, Model, and Algorithms
Proceedings of the 27th International Conference on Very Large Data Bases
Approximate String Joins in a Database (Almost) for Free
Proceedings of the 27th International Conference on Very Large Data Bases
Text joins in an RDBMS for web data integration
WWW '03 Proceedings of the 12th international conference on World Wide Web
A Primitive Operator for Similarity Joins in Data Cleaning
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Record linkage: similarity measures and algorithms
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Eliminating fuzzy duplicates in data warehouses
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Hi-index | 0.00 |
Similarity join over text is important in text retrieval and query. Due to the incomplete formats of information representation, such as abbreviation and short word, similarity joins should address an asymmetric feature that these incomplete formats may contain only partial information of their original representation. Current approaches, including cosine similarity with q-grams, can hardly deal with the asymmetric feature of similarity between words and their incomplete formats. In order to find this type of incomplete format information with asymmetric features, we develop a new similarity join algorithm, namely IJoin. A novel matching scheme is proposed to identify the overlap between two entities with incomplete formats. Other than q-grams, we reconnect the sequence of words in a string to reserve the abbreviated information. Based on the asymmetric features of similar entities with incomplete formats, we adopt a new similarity function. Furthermore, an efficient algorithm is implemented by using the join operation in SQL, which reduces pairs of tuples in similarity comparison. The experimental evaluation demonstrates the effectiveness and the efficiency of our approach.