Information retrieval: data structures and algorithms
Information retrieval: data structures and algorithms
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Scaling up all pairs similarity search
Proceedings of the 16th international conference on World Wide Web
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Improving text classification for oral history archives with temporal domain knowledge
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions
Communications of the ACM - 50th anniversary issue: 1958 - 2008
Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Evaluating SPLASH-2 Applications Using MapReduce
APPT '09 Proceedings of the 8th International Symposium on Advanced Parallel Processing Technologies
Exploring large-data issues in the curriculum: a case study with MapReduce
TeachCL '08 Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics
Canonical image selection and efficient image graph construction for large-scale flickr photos
MM '09 Proceedings of the 17th ACM international conference on Multimedia
Packing the most onto your cloud
Proceedings of the first international workshop on Cloud data management
Arabic cross-document coreference detection
ACLShort '09 Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
Web-scale distributional similarity and entity set expansion
EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
Proceedings of the 19th international conference on World wide web
Pairwise Element Computation with MapReduce
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
From frequency to meaning: vector space models of semantics
Journal of Artificial Intelligence Research
Fast query expansion using approximations of relevance models
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Proceedings of the international conference on Multimedia
Large-scale multimodal mining for healthcare with mapreduce
Proceedings of the 1st ACM International Health Informatics Symposium
Parallel implementation of classification algorithms based on MapReduce
RSKT'10 Proceedings of the 5th international conference on Rough set and knowledge technology
Efficient indexing of repeated n-grams
Proceedings of the fourth ACM international conference on Web search and data mining
Macademia: semantic visualization of research interests
Proceedings of the 16th international conference on Intelligent user interfaces
RanKloud: a scalable ranked query processing framework on hadoop
Proceedings of the 14th International Conference on Extending Database Technology
Mavuno: a scalable and effective Hadoop-based paraphrase acquisition system
Proceedings of the Third Workshop on Large Scale Data Mining: Theory and Applications
No free lunch: brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
An event-centric model for multilingual document similarity
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
A distributed look-up architecture for text mining applications using MapReduce
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Block-based load balancing for entity resolution with MapReduce
Proceedings of the 20th ACM international conference on Information and knowledge management
Learning-based entity resolution with MapReduce
Proceedings of the third international workshop on Cloud data management
Case study of scientific data processing on a cloud using hadoop
HPCS'09 Proceedings of the 23rd international conference on High Performance Computing Systems and Applications
V-SMART-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors
Proceedings of the VLDB Endowment
InfoGather: entity augmentation and attribute discovery by holistic matching with web tables
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Towards building large-scale distributed systems for twitter sentiment analysis
Proceedings of the 27th Annual ACM Symposium on Applied Computing
MapReduce algorithms for big data analysis
Proceedings of the VLDB Endowment
Multimedia Applications and Security in MapReduce: Opportunities and Challenges
Concurrency and Computation: Practice & Experience
Computing scientometrics in large-scale academic search engines with mapreduce
WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
Automatic thesaurus construction for cross generation corpus
Journal on Computing and Cultural Heritage (JOCCH)
Breaking the MapReduce stage barrier
Cluster Computing
Graph-based semi-supervised learning with multi-modality propagation for large-scale image datasets
Journal of Visual Communication and Image Representation
Don't match twice: redundancy-free similarity computation with MapReduce
Proceedings of the Second Workshop on Data Analytics in the Cloud
Scalable all-pairs similarity search in metric spaces
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Simplifying MapReduce data processing
International Journal of Computational Science and Engineering
Dimension independent similarity computation
The Journal of Machine Learning Research
Journal of Visual Communication and Image Representation
Hi-index | 0.00 |
This paper presents a MapReduce algorithm for computing pairwise document similarity in large document collections. MapReduce is an attractive framework because it allows us to decompose the inner products involved in computing document similarity into separate multiplication and summation stages in a way that is well matched to efficient disk access patterns across several machines. On a collection consisting of approximately 900,000 newswire articles, our algorithm exhibits linear growth in running time and space in terms of the number of documents.