Document filtering for fast ranking
SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Query evaluation: strategies and optimizations
Information Processing and Management: an International Journal
Self-indexing inverted files for fast text retrieval
ACM Transactions on Information Systems (TOIS)
Optimization of inverted vector searches
SIGIR '85 Proceedings of the 8th annual international ACM SIGIR conference on Research and development in information retrieval
Approximate nearest neighbors: towards removing the curse of dimensionality
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
Agglomerative clustering of a search engine query log
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
High performance clustering based on the similarity join
Proceedings of the ninth international conference on Information and knowledge management
Similarity estimation techniques from rounding algorithms
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Database Management Systems
Similarity Search in High Dimensions via Hashing
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Efficient similarity search and classification via rank aggregation
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Efficient set joins on similarity predicates
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Semantic similarity between search engine queries using temporal correlation
WWW '05 Proceedings of the 14th international conference on World Wide Web
Optimization strategies for complex queries
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating similarity measures: a large-scale study in the orkut social network
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Taxonomy generation for text segments: A practical web-based approach
ACM Transactions on Information Systems (TOIS)
A Primitive Operator for Similarity Joins in Data Cleaning
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
A web-based kernel function for measuring the similarity of short text snippets
Proceedings of the 15th international conference on World Wide Web
Efficient exact set-similarity joins
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Detectives: detecting coalition hit inflation attacks in advertising networks streams
Proceedings of the 16th international conference on World Wide Web
User-assisted similarity estimation for searching related web pages
Proceedings of the eighteenth conference on Hypertext and hypermedia
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Efficient similarity joins for near duplicate detection
Proceedings of the 17th international conference on World Wide Web
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Efficient Similarity Search for Tree-Structured Data
SSDBM '08 Proceedings of the 20th international conference on Scientific and Statistical Database Management
Ed-Join: an efficient algorithm for similarity joins with edit distance constraints
Proceedings of the VLDB Endowment
Scalable mining of large video databases using copy detection
MM '08 Proceedings of the 16th ACM international conference on Multimedia
Proceedings of the 16th ACM SIGSPATIAL international conference on Advances in geographic information systems
Fast Content-Based Mining of Web2.0 Videos
PCM '08 Proceedings of the 9th Pacific Rim Conference on Multimedia: Advances in Multimedia Information Processing
Efficient overlap and content reuse detection in blogs and online news articles
Proceedings of the 18th international conference on World wide web
Fast error-tolerant search on very large texts
Proceedings of the 2009 ACM symposium on Applied Computing
Efficient top-k algorithms for fuzzy search in string collections
Proceedings of the First International Workshop on Keyword Search on Structured Data
Pairwise document similarity in large collections with MapReduce
HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Efficient approximate entity extraction with edit distance constraints
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Efficient Set Similarity Joins Using Min-prefixes
ADBIS '09 Proceedings of the 13th East European Conference on Advances in Databases and Information Systems
A cluster-based approach to XML similarity joins
IDEAS '09 Proceedings of the 2009 International Database Engineering & Applications Symposium
Semi-automatic entity set refinement
NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Fast Matching for All Pairs Similarity Search
WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Creating probabilistic databases from duplicated data
The VLDB Journal — The International Journal on Very Large Data Bases
A framework for semantic link discovery over relational data
Proceedings of the 18th ACM conference on Information and knowledge management
Similarity-aware indexing for real-time entity resolution
Proceedings of the 18th ACM conference on Information and knowledge management
Power-law based estimation of set similarity join size
Proceedings of the VLDB Endowment
Framework for evaluating clustering algorithms in duplicate detection
Proceedings of the VLDB Endowment
Web-scale distributional similarity and entity set expansion
EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
An incremental clustering scheme for data de-duplication
Data Mining and Knowledge Discovery
On compressing the textual web
Proceedings of the third ACM international conference on Web search and data mining
Incremental all pairs similarity search for varying similarity thresholds
Proceedings of the 3rd Workshop on Social Network Mining and Analysis
HARRA: fast iterative hashed record linkage for large-scale data collections
Proceedings of the 13th International Conference on Extending Database Technology
The Journal of Machine Learning Research
Efficient parallel set-similarity joins using MapReduce
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Bed-tree: an all-purpose index structure for string similarity search based on edit distance
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
From frequency to meaning: vector space models of semantics
Journal of Artificial Intelligence Research
Similarity joins as stronger metric operations
SIGSPATIAL Special
Generalizing prefix filtering to improve set similarity joins
Information Systems
CasJoin: a cascade chain for text similarity joins
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
An indexing scheme for fast and accurate chemical fingerprint database searching
SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
Duplicate identification in deep web data integration
WAIM'10 Proceedings of the 11th international conference on Web-age information management
CAMEO: continuous analytics for massively multiplayer online games on cloud resources
Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
An efficient similarity join algorithm with cosine similarity predicate
DEXA'10 Proceedings of the 21st international conference on Database and expert systems applications: Part II
Scaling up top-K cosine similarity search
Data & Knowledge Engineering
Set similarity join on probabilistic data
Proceedings of the VLDB Endowment
Trie-join: efficient trie-based string similarity joins with edit-distance constraints
Proceedings of the VLDB Endowment
The social bookmark and publication management system bibsonomy
The VLDB Journal — The International Journal on Very Large Data Bases
Towards active detection of identity clone attacks on online social networks
Proceedings of the first ACM conference on Data and application security and privacy
Proceedings of the 9th Annual Workshop on Network and Systems Support for Games
Context-sensitive document ranking
Journal of Computer Science and Technology
Symmetrizations for clustering directed graphs
Proceedings of the 14th International Conference on Extending Database Technology
Efficient k-nearest neighbor graph construction for generic similarity measures
Proceedings of the 20th international conference on World wide web
Foundations and Trends in Databases
Similarity join size estimation using locality sensitive hashing
Proceedings of the VLDB Endowment
Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
ATLAS: a probabilistic algorithm for high dimensional similarity search
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient exact edit similarity query processing with the asymmetric signature scheme
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Batch text similarity search with MapReduce
APWeb'11 Proceedings of the 13th Asia-Pacific web conference on Web technologies and applications
Efficient similarity joins for near-duplicate detection
ACM Transactions on Database Systems (TODS)
Mavuno: a scalable and effective Hadoop-based paraphrase acquisition system
Proceedings of the Third Workshop on Large Scale Data Mining: Theory and Applications
A supervised machine learning approach for duplicate detection over gazetteer records
GeoS'11 Proceedings of the 4th international conference on GeoSpatial semantics
No free lunch: brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Fast locality-sensitive hashing
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient fuzzy full-text type-ahead search
The VLDB Journal — The International Journal on Very Large Data Bases
Cosine interesting pattern discovery
Information Sciences: an International Journal
Automatically generating data linkages using a domain-independent candidate selection approach
ISWC'11 Proceedings of the 10th international conference on The semantic web - Volume Part I
Semi-supervised learning to rank with preference regularization
Proceedings of the 20th ACM international conference on Information and knowledge management
Filtering and clustering relations for unsupervised information extraction in open domain
Proceedings of the 20th ACM international conference on Information and knowledge management
Pass-join: a partition-based method for similarity joins
Proceedings of the VLDB Endowment
Cross-language high similarity search: why no sub-linear time bound can be expected
ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Bayesian locality sensitive hashing for fast similarity search
Proceedings of the VLDB Endowment
Clustering and load balancing optimization for redundant content removal
Proceedings of the 21st international conference companion on World Wide Web
V-SMART-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors
Proceedings of the VLDB Endowment
Can we beat the prefix filtering?: an adaptive framework for similarity join and search
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Measuring semantic similarity between words by removing noise and redundancy in web snippets
Concurrency and Computation: Practice & Experience
CRSI: a compact randomized similarity index for set-valued features
Proceedings of the 15th International Conference on Extending Database Technology
Mining temporal patterns in popularity of web items
Information Sciences: an International Journal
Seal: spatio-textual similarity search
Proceedings of the VLDB Endowment
Distributed KNN-graph approximation via hashing
Proceedings of the 2nd ACM International Conference on Multimedia Retrieval
Maximum inner-product search using cone trees
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Learning hash codes for efficient content reuse detection
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
CrowdER: crowdsourcing entity resolution
Proceedings of the VLDB Endowment
Efficient range queries over uncertain strings
SSDBM'12 Proceedings of the 24th international conference on Scientific and Statistical Database Management
An optimized in-network aggregation scheme for data collection in periodic sensor networks
ADHOC-NOW'12 Proceedings of the 11th international conference on Ad-hoc, Mobile, and Wireless Networks
Scaling pair-wise similarity-based algorithms in tagging spaces
ICWE'12 Proceedings of the 12th international conference on Web Engineering
Scalable similarity-based neighborhood methods with MapReduce
Proceedings of the sixth ACM conference on Recommender systems
Measuring website similarity using an entity-aware click graph
Proceedings of the 21st ACM international conference on Information and knowledge management
Star-Join: spatio-textual similarity join
Proceedings of the 21st ACM international conference on Information and knowledge management
Landmark-join: hash-join based string similarity joins with edit distance constraints
DaWaK'12 Proceedings of the 14th international conference on Data Warehousing and Knowledge Discovery
Detecting near-duplicate documents using sentence-level features and supervised learning
Expert Systems with Applications: An International Journal
Link discovery with guaranteed reduction ratio in affine spaces with minkowski measures
ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part I
ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part II
Spatio-textual similarity joins
Proceedings of the VLDB Endowment
Optimizing parallel algorithms for all pairs similarity search
Proceedings of the sixth ACM international conference on Web search and data mining
Domain-Independent Entity Coreference for Linking Ontology Instances
Journal of Data and Information Quality (JDIQ) - Special Issue on Entity Resolution
Automatic thesaurus construction for cross generation corpus
Journal on Computing and Cultural Heritage (JOCCH)
Towards scalable real-time entity resolution using a similarity-aware inverted index approach
AusDM '08 Proceedings of the 7th Australasian Data Mining Conference - Volume 87
Proceedings of the Joint EDBT/ICDT 2013 Workshops
Trie-based similarity search and join
Proceedings of the Joint EDBT/ICDT 2013 Workshops
Cache-aware parallel approximate matching and join algorithms using BWT
Proceedings of the Joint EDBT/ICDT 2013 Workshops
Efficient fuzzy search in large text collections
ACM Transactions on Information Systems (TOIS)
Accuracy vs. Speed: Scalable Entity Coreference on the Semantic Web with On-the-Fly Pruning
WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
PartSS: an efficient partition-based filtering for edit distance constraints
ADC '11 Proceedings of the Twenty-Second Australasian Database Conference - Volume 115
String similarity measures and joins with synonyms
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Cache-conscious performance optimization for similarity search
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Tuning large scale deduplication with reduced effort
Proceedings of the 25th International Conference on Scientific and Statistical Database Management
A partition-based method for string similarity joins with edit-distance constraints
ACM Transactions on Database Systems (TODS)
Scalable all-pairs similarity search in metric spaces
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Scalable k-nearest neighbor graph construction based on greedy filtering
Proceedings of the 22nd international conference on World Wide Web companion
Distributed data management using MapReduce
ACM Computing Surveys (CSUR)
QUBiC: An adaptive approach to query-based recommendation
Journal of Intelligent Information Systems
A two-phase algorithm for mining sequential patterns with differential privacy
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Asymmetric signature schemes for efficient exact edit similarity query processing
ACM Transactions on Database Systems (TODS)
Entity resolution on uncertain relations
WAIM'13 Proceedings of the 14th international conference on Web-Age Information Management
Extending string similarity join to tolerant fuzzy token matching
ACM Transactions on Database Systems (TODS)
PLASMA-HD: probing the lattice structure and makeup of high-dimensional data
Proceedings of the VLDB Endowment
Scalable K-Means by ranked retrieval
Proceedings of the 7th ACM international conference on Web search and data mining
Journal of Visual Communication and Image Representation
Using Non-Zero Dimensions for the Cosine and Tanimoto Similarity Search Among Real Valued Vectors
Fundamenta Informaticae - To Andrzej Skowron on His 70th Birthday
Hi-index | 0.02 |
Given a large collection of sparse vector data in a high dimensional space, we investigate the problem of finding all pairs of vectors whose similarity score (as determined by a function such as cosine distance) is above a given threshold. We propose a simple algorithm based on novel indexing and optimization strategies that solves this problem without relying on approximation methods or extensive parameter tuning. We show the approach efficiently handles a variety of datasets across a wide setting of similarity thresholds, with large speedups over previous state-of-the-art approaches.