Batch text similarity search with MapReduce

Authors:
Rui Li;Li Ju;Zhuo Peng;Zhiwei Yu;Chaokun Wang
Affiliations:
School of Software, Tsinghua University and Tsinghua National Laboratory for Information Science and Technology and Key Laboratory for Information System Security, Ministry of Education, Beijing, ...;Department of Information Engineering, Henan College of Finance and Taxation, Zhengzhou, China;School of Software, Tsinghua University, Beijing, China;Department of Computer Science and Technology, Tsinghua University;School of Software, Tsinghua University and Tsinghua National Laboratory for Information Science and Technology and Key Laboratory for Information System Security, Ministry of Education, Beijing, ...
Venue:
APWeb'11 Proceedings of the 13th Asia-Pacific web conference on Web technologies and applications
Year:
2011

Citing 12
Cited 2

Fast parallel similarity search in multimedia databases

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
A Primitive Operator for Similarity Joins in Data Cleaning

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Text similarity: an alternative way to search MEDLINE

Bioinformatics
Scaling up all pairs similarity search

Proceedings of the 16th international conference on World Wide Web
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Similarity search for web services

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Efficient similarity joins for near duplicate detection

Proceedings of the 17th international conference on World Wide Web
Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Efficient parallel set-similarity joins using MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

Apriori-based frequent itemset mining algorithms on MapReduce

Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication
Multimedia Applications and Security in MapReduce: Opportunities and Challenges

Concurrency and Computation: Practice & Experience

Quantified Score

Hi-index	0.00

Visualization

Abstract

Batch text similarity search aims to find the similar texts according to users' batch text queries. It is widely used in the real world such as plagiarism check, and attracts more and more attention with the emergence of abundant texts on the web. Existing works, such as FuzzyJoin, can neither support the variation of thresholds, nor support the online batch text similarity search. In this paper, a two-stage algorithm is proposed. It can effectively resolve the problem of batch text similarity search based on inverted index structures. Experimental results on real datasets show the efficiency and expansibility of our method.