Semi-supervised SimHash for efficient document similarity search

Authors:
Qixia Jiang;Maosong Sun
Affiliations:
Tsinghua University, Beijing, China;Tsinghua University, Beijing, China
Venue:
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Year:
2011

Citing 12
Cited 1

On the limited memory BFGS method for large scale optimization

Mathematical Programming: Series A and B
Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Finding related pages in the World Wide Web

WWW '99 Proceedings of the eighth international conference on World Wide Web
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
A Tutorial on Support Vector Machines for Pattern Recognition

Data Mining and Knowledge Discovery
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Detecting near-duplicates for web crawling

Proceedings of the 16th international conference on World Wide Web
Towards a unified approach to document similarity search using manifold-ranking of blocks

Information Processing and Management: an International Journal
Fast Similarity Search for Learned Metrics

IEEE Transactions on Pattern Analysis and Machine Intelligence
An incremental clustering scheme for data de-duplication

Data Mining and Knowledge Discovery
Learning When Concepts Abound

The Journal of Machine Learning Research

Detecting near-duplicate documents using sentence-level features and supervised learning

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Searching documents that are similar to a query document is an important component in modern information retrieval. Some existing hashing methods can be used for efficient document similarity search. However, unsupervised hashing methods cannot incorporate prior knowledge for better hashing. Although some supervised hashing methods can derive effective hash functions from prior knowledge, they are either computationally expensive or poorly discriminative. This paper proposes a novel (semi-)supervised hashing method named Semi-Supervised SimHash (S3H) for high-dimensional data similarity search. The basic idea of S3H is to learn the optimal feature weights from prior knowledge to relocate the data such that similar data have similar hash codes. We evaluate our method with several state-of-the-art methods on two large datasets. All the results show that our method gets the best performance.