SIGMOD '89 Proceedings of the 1989 ACM SIGMOD international conference on Management of data
Proceedings of the sixteenth international conference on Very large databases
Parallel database systems: the future of high performance database systems
Communications of the ACM
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
Similarity Search in High Dimensions via Hashing
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Approximate String Joins in a Database (Almost) for Free
Proceedings of the 27th International Conference on Very Large Data Bases
An Evaluation of Non-Equijoin Algorithms
VLDB '91 Proceedings of the 17th International Conference on Very Large Data Bases
Methods for identifying versioned and plagiarized documents
Journal of the American Society for Information Science and Technology
Efficient set joins on similarity predicates
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Evaluating similarity measures: a large-scale study in the orkut social network
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
A Primitive Operator for Similarity Joins in Data Cleaning
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
A web-based kernel function for measuring the similarity of short text snippets
Proceedings of the 15th international conference on World Wide Web
Finding near-duplicate web pages: a large-scale evaluation of algorithms
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Efficient exact set-similarity joins
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Scaling up all pairs similarity search
Proceedings of the 16th international conference on World Wide Web
Detectives: detecting coalition hit inflation attacks in advertising networks streams
Proceedings of the 16th international conference on World Wide Web
Map-reduce-merge: simplified relational data processing on large clusters
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters
Communications of the ACM - 50th anniversary issue: 1958 - 2008
Efficient similarity joins for near duplicate detection
Proceedings of the 17th international conference on World Wide Web
Ed-Join: an efficient algorithm for similarity joins with edit distance constraints
Proceedings of the VLDB Endowment
A comparison of approaches to large-scale data analysis
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Building a high-level dataflow system on top of Map-Reduce: the Pig experience
Proceedings of the VLDB Endowment
Similarity joins as stronger metric operations
SIGSPATIAL Special
Real-life performance of metric searching
SIGSPATIAL Special
Efficient answering of set containment queries for skewed item distributions
Proceedings of the 14th International Conference on Extending Database Technology
Efficient k-nearest neighbor graph construction for generic similarity measures
Proceedings of the 20th international conference on World wide web
ASTERIX: towards a scalable, semistructured data platform for evolving-world models
Distributed and Parallel Databases
A fast approach for parallel deduplication on multicore processors
Proceedings of the 2011 ACM Symposium on Applied Computing
Processing theta-joins using MapReduce
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
ATLAS: a probabilistic algorithm for high dimensional similarity search
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient processing of data warehousing queries in a split execution environment
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Batch text similarity search with MapReduce
APWeb'11 Proceedings of the 13th Asia-Pacific web conference on Web technologies and applications
No free lunch: brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Efficient duplicate detection on cloud using a new signature scheme
WAIM'11 Proceedings of the 12th international conference on Web-age information management
Block-based load balancing for entity resolution with MapReduce
Proceedings of the 20th ACM international conference on Information and knowledge management
Learning-based entity resolution with MapReduce
Proceedings of the third international workshop on Cloud data management
Building wavelet histograms on large data in MapReduce
Proceedings of the VLDB Endowment
Pass-join: a partition-based method for similarity joins
Proceedings of the VLDB Endowment
SpSJoin: parallel spatial similarity joins
Proceedings of the 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
Parallel data processing with MapReduce: a survey
ACM SIGMOD Record
Multi-pass sorted neighborhood blocking with MapReduce
Computer Science - Research and Development
Apriori-based frequent itemset mining algorithms on MapReduce
Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication
The HaLoop approach to large-scale iterative data analysis
The VLDB Journal — The International Journal on Very Large Data Bases
V-SMART-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors
Proceedings of the VLDB Endowment
Can we beat the prefix filtering?: an adaptive framework for similarity join and search
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Exploiting MapReduce-based similarity joins
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
SWIM '12 Proceedings of the 4th International Workshop on Semantic Web Information Management
Scalable sequence similarity search and join in main memory on multi-cores
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Entity matching for semistructured data in the Cloud
Proceedings of the 27th Annual ACM Symposium on Applied Computing
Inside "Big Data management": ogres, onions, or parfaits?
Proceedings of the 15th International Conference on Extending Database Technology
Efficient parallel kNN joins for large data in MapReduce
Proceedings of the 15th International Conference on Extending Database Technology
Adaptive MapReduce using situation-aware mappers
Proceedings of the 15th International Conference on Extending Database Technology
On generating large-scale ground truth datasets for the deduplication of bibliographic records
Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics
Integrating open government data with stratosphere for more transparency
Web Semantics: Science, Services and Agents on the World Wide Web
ASTERIX: scalable warehouse-style web data integration
Proceedings of the Ninth International Workshop on Information Integration on the Web
Efficient processing of k nearest neighbor joins using MapReduce
Proceedings of the VLDB Endowment
MapReduce-based similarity join for metric spaces
Proceedings of the 1st International Workshop on Cloud Intelligence
Efficient multi-way theta-join processing using MapReduce
Proceedings of the VLDB Endowment
Constructing virtual documents for ontology matching using mapreduce
JIST'11 Proceedings of the 2011 joint international conference on The Semantic Web
ASTERIX: an open source system for "Big Data" management and analysis (demo)
Proceedings of the VLDB Endowment
MapReduce algorithms for big data analysis
Proceedings of the VLDB Endowment
Designing good algorithms for MapReduce and beyond
Proceedings of the Third ACM Symposium on Cloud Computing
An automatic blocking mechanism for large-scale de-duplication tasks
Proceedings of the 21st ACM international conference on Information and knowledge management
Landmark-join: hash-join based string similarity joins with edit distance constraints
DaWaK'12 Proceedings of the 14th international conference on Data Warehousing and Knowledge Discovery
Optimizing parallel algorithms for all pairs similarity search
Proceedings of the sixth ACM international conference on Web search and data mining
A MapReduce-based filtering algorithm for vector similarity join
Proceedings of the 7th International Conference on Ubiquitous Information Management and Communication
Eagle-eyed elephant: split-oriented indexing in Hadoop
Proceedings of the 16th International Conference on Extending Database Technology
Processing multi-way spatial joins on map-reduce
Proceedings of the 16th International Conference on Extending Database Technology
A performance comparison of parallel DBMSs and MapReduce on large-scale text analytics
Proceedings of the 16th International Conference on Extending Database Technology
Proceedings of the Joint EDBT/ICDT 2013 Workshops
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Tuning large scale deduplication with reduced effort
Proceedings of the 25th International Conference on Scientific and Statistical Database Management
Don't match twice: redundancy-free similarity computation with MapReduce
Proceedings of the Second Workshop on Data Analytics in the Cloud
A partition-based method for string similarity joins with edit-distance constraints
ACM Transactions on Database Systems (TODS)
Scalable all-pairs similarity search in metric spaces
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Upper and lower bounds on the cost of a map-reduce computation
Proceedings of the VLDB Endowment
Anatomy of a web-scale resale market: a data mining approach
Proceedings of the 22nd international conference on World Wide Web
Toward intersection filter-based optimization for joins in MapReduce
Proceedings of the 2nd International Workshop on Cloud Intelligence
Distributed data management using MapReduce
ACM Computing Surveys (CSUR)
Super-EGO: fast multi-dimensional similarity join
The VLDB Journal — The International Journal on Very Large Data Bases
The family of mapreduce and large-scale data processing systems
ACM Computing Surveys (CSUR)
Extending string similarity join to tolerant fuzzy token matching
ACM Transactions on Database Systems (TODS)
WOO: a scalable and multi-tenant platform for continuous knowledge base synthesis
Proceedings of the VLDB Endowment
Scalable column concept determination for web tables using large knowledge bases
Proceedings of the VLDB Endowment
ComMapReduce: An improvement of MapReduce with lightweight communication mechanisms
Data & Knowledge Engineering
Hi-index | 0.00 |
In this paper we study how to efficiently perform set-similarity joins in parallel using the popular MapReduce framework. We propose a 3-stage approach for end-to-end set-similarity joins. We take as input a set of records and output a set of joined records based on a set-similarity condition. We efficiently partition the data across nodes in order to balance the workload and minimize the need for replication. We study both self-join and R-S join cases, and show how to carefully control the amount of data kept in main memory on each node. We also propose solutions for the case where, even if we use the most fine-grained partitioning, the data still does not fit in the main memory of a node. We report results from extensive experiments on real datasets, synthetically increased in size, to evaluate the speedup and scaleup properties of the proposed algorithms using Hadoop.