Semantic hashing using tags and topic modeling

Authors:
Qifan Wang;Dan Zhang;Luo Si
Affiliations:
Purdue University, West Lafayette, IN, USA;Facebook Incorporation, Menlo Park, CA, USA;Purdue University, West Lafayette, IN, USA
Venue:
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Year:
2013

Citing 27
Cited 2

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
The nature of statistical learning theory

The nature of statistical learning theory
Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to algorithms

Introduction to algorithms
A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Collaborative filtering via gaussian probabilistic latent semantic analysis

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Latent dirichlet allocation

The Journal of Machine Learning Research
Locality-sensitive hashing scheme based on p-stable distributions

SCG '04 Proceedings of the twentieth annual symposium on Computational geometry
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Inverted files for text search engines

ACM Computing Surveys (CSUR)
A fast learning algorithm for deep belief nets

Neural Computation
Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions

FOCS '06 Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science
Principles of hash-based text retrieval

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating topic models for information retrieval

Proceedings of the 17th ACM conference on Information and knowledge management
Collaborative Filtering for Implicit Feedback Datasets

ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
Semantic hashing

International Journal of Approximate Reasoning
Self-taught hashing for fast similarity search

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
VSEncoding: efficient coding and fast decoding of integer lists via dynamic programming

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Efficient set intersection for inverted indexing

ACM Transactions on Information Systems (TOIS)
Composite hashing with multiple information sources

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Collaborative topic modeling for recommending scientific articles

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Scalable inference in latent variable models

Proceedings of the fifth ACM international conference on Web search and data mining
Boosting multi-kernel locality-sensitive hashing for scalable image retrieval

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Supervised hashing with kernels

CVPR '12 Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
A discriminative data-dependent mixture-model approach for multiple instance learning in image classification

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part IV
Semi-Supervised Hashing for Large-Scale Search

IEEE Transactions on Pattern Analysis and Machine Intelligence

Learning compact hashing codes for efficient tag completion and prediction

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Weighted hashing for fast large scale similarity search

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

It is an important research problem to design efficient and effective solutions for large scale similarity search. One popular strategy is to represent data examples as compact binary codes through semantic hashing, which has produced promising results with fast search speed and low storage cost. Many existing semantic hashing methods generate binary codes for documents by modeling document relationships based on similarity in a keyword feature space. Two major limitations in existing methods are: (1) Tag information is often associated with documents in many real world applications, but has not been fully exploited yet; (2) The similarity in keyword feature space does not fully reflect semantic relationships that go beyond keyword matching. This paper proposes a novel hashing approach, Semantic Hashing using Tags and Topic Modeling (SHTTM), to incorporate both the tag information and the similarity information from probabilistic topic modeling. In particular, a unified framework is designed for ensuring hashing codes to be consistent with tag information by a formal latent factor model and preserving the document topic/semantic similarity that goes beyond keyword matching. An iterative coordinate descent procedure is proposed for learning the optimal hashing codes. An extensive set of empirical studies on four different datasets has been conducted to demonstrate the advantages of the proposed SHTTM approach against several other state-of-the-art semantic hashing techniques. Furthermore, experimental results indicate that the modeling of tag information and utilizing topic modeling are beneficial for improving the effectiveness of hashing separately, while the combination of these two techniques in the unified framework obtains even better results.