On the limited memory BFGS method for large scale optimization
Mathematical Programming: Series A and B
Approximate nearest neighbors: towards removing the curse of dimensionality
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Finding related pages in the World Wide Web
WWW '99 Proceedings of the eighth international conference on World Wide Web
Similarity estimation techniques from rounding algorithms
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
A Tutorial on Support Vector Machines for Pattern Recognition
Data Mining and Knowledge Discovery
An extensive empirical study of feature selection metrics for text classification
The Journal of Machine Learning Research
Pattern Classification (2nd Edition)
Pattern Classification (2nd Edition)
Detecting near-duplicates for web crawling
Proceedings of the 16th international conference on World Wide Web
Towards a unified approach to document similarity search using manifold-ranking of blocks
Information Processing and Management: an International Journal
Fast Similarity Search for Learned Metrics
IEEE Transactions on Pattern Analysis and Machine Intelligence
An incremental clustering scheme for data de-duplication
Data Mining and Knowledge Discovery
The Journal of Machine Learning Research
Detecting near-duplicate documents using sentence-level features and supervised learning
Expert Systems with Applications: An International Journal
Hi-index | 0.00 |
Searching documents that are similar to a query document is an important component in modern information retrieval. Some existing hashing methods can be used for efficient document similarity search. However, unsupervised hashing methods cannot incorporate prior knowledge for better hashing. Although some supervised hashing methods can derive effective hash functions from prior knowledge, they are either computationally expensive or poorly discriminative. This paper proposes a novel (semi-)supervised hashing method named Semi-Supervised SimHash (S3H) for high-dimensional data similarity search. The basic idea of S3H is to learn the optimal feature weights from prior knowledge to relocate the data such that similar data have similar hash codes. We evaluate our method with several state-of-the-art methods on two large datasets. All the results show that our method gets the best performance.