Query by document via a decomposition-based two-level retrieval approach

Authors:
Linkai Weng;Zhiwei Li;Rui Cai;Yaoxue Zhang;Yuezhi Zhou;Laurence T. Yang;Lei Zhang
Affiliations:
Tsinghua University, Beijing, China;Microsoft Research Asia, Beijing, China;Microsoft Research Asia, Beijing, China;Tsinghua University, Beijing, China;Tsinghua University, Beijing, China;St. Francis Xavier University, Antigonish, Canada;Microsoft Research Asia, Beijing, China
Venue:
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Year:
2011

Citing 30
Cited 2

Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Finding replicated Web collections

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Item-based collaborative filtering recommendation algorithms

Proceedings of the 10th international conference on World Wide Web
Model-based feedback in the language modeling approach to information retrieval

Proceedings of the tenth international conference on Information and knowledge management
Collection statistics for fast duplicate document detection

ACM Transactions on Information Systems (TOIS)
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Modern Information Retrieval

Modern Information Retrieval
Novelty and redundancy detection in adaptive filtering

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Learning Algorithms for Keyphrase Extraction

Information Retrieval
Web Search for a Planet: The Google Cluster Architecture

IEEE Micro
Models and Algorithms for Duplicate Document Detection

ICDAR '99 Proceedings of the Fifth International Conference on Document Analysis and Recognition
Latent dirichlet allocation

The Journal of Machine Learning Research
Efficient query evaluation using a two-level retrieval process

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Incorporating context information for the extraction of terms

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Cluster-based retrieval using language models

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
A language model approach to keyphrase extraction

MWE '03 Proceedings of the ACL 2003 workshop on Multiword expressions: analysis, acquisition and treatment - Volume 18
Inverted files for text search engines

ACM Computing Surveys (CSUR)
Thesaurus based automatic keyphrase indexing

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Approximately detecting duplicates for streaming data using stable bloom filters

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
LDA-based document models for ad-hoc retrieval

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Finding near-duplicate web pages: a large-scale evaluation of algorithms

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
New Algorithms for Efficient High-Dimensional Nonparametric Classification

The Journal of Machine Learning Research
Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Learning to classify short and sparse text & web with hidden topics from large-scale data collections

Proceedings of the 17th international conference on World Wide Web
Query by document

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Nonnegative matrix factorization and probabilistic latent semantic indexing: equivalence, chi-square statistic, and a hybrid method

AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
Decomposing background topics from keywords by principal component pursuit

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management

Improving bag-of-visual-words model with spatial-temporal correlation for video retrieval

Proceedings of the 21st ACM international conference on Information and knowledge management
Entity-centric document filtering: boosting feature mapping through meta-features

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Retrieving similar documents from a large-scale text corpus according to a given document is a fundamental technique for many applications. However, most of existing indexing techniques have difficulties to address this problem due to special properties of a document query, e.g. high dimensionality, sparse representation and semantic concern. Towards addressing this problem, we propose a two-level retrieval solution based on a document decomposition idea. A document is decomposed to a compact vector and a few document specific keywords by a dimension reduction approach. The compact vector embodies the major semantics of a document, and the document specific keywords complement the discriminative power lost in dimension reduction process. We adopt locality sensitive hashing (LSH) to index the compact vectors, which guarantees to quickly find a set of related documents according to the vector of a query document. Then we re-rank documents in this set by their document specific keywords. In experiments, we obtained promising results on various datasets in terms of both accuracy and performance. We demonstrated that this solution is able to index large-scale corpus for efficient similarity-based document retrieval.