Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Foundations of statistical natural language processing
Foundations of statistical natural language processing
Probabilistic latent semantic indexing
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Finding replicated Web collections
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Item-based collaborative filtering recommendation algorithms
Proceedings of the 10th international conference on World Wide Web
Model-based feedback in the language modeling approach to information retrieval
Proceedings of the tenth international conference on Information and knowledge management
Collection statistics for fast duplicate document detection
ACM Transactions on Information Systems (TOIS)
Similarity estimation techniques from rounding algorithms
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Modern Information Retrieval
Novelty and redundancy detection in adaptive filtering
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Learning Algorithms for Keyphrase Extraction
Information Retrieval
Models and Algorithms for Duplicate Document Detection
ICDAR '99 Proceedings of the Fifth International Conference on Document Analysis and Recognition
The Journal of Machine Learning Research
Efficient query evaluation using a two-level retrieval process
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Incorporating context information for the extraction of terms
ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Cluster-based retrieval using language models
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
A language model approach to keyphrase extraction
MWE '03 Proceedings of the ACL 2003 workshop on Multiword expressions: analysis, acquisition and treatment - Volume 18
Inverted files for text search engines
ACM Computing Surveys (CSUR)
Thesaurus based automatic keyphrase indexing
Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Approximately detecting duplicates for streaming data using stable bloom filters
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
LDA-based document models for ad-hoc retrieval
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Finding near-duplicate web pages: a large-scale evaluation of algorithms
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
New Algorithms for Efficient High-Dimensional Nonparametric Classification
The Journal of Machine Learning Research
Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions
Communications of the ACM - 50th anniversary issue: 1958 - 2008
Proceedings of the 17th international conference on World Wide Web
Proceedings of the Second ACM International Conference on Web Search and Data Mining
AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
Decomposing background topics from keywords by principal component pursuit
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Improving bag-of-visual-words model with spatial-temporal correlation for video retrieval
Proceedings of the 21st ACM international conference on Information and knowledge management
Entity-centric document filtering: boosting feature mapping through meta-features
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Hi-index | 0.00 |
Retrieving similar documents from a large-scale text corpus according to a given document is a fundamental technique for many applications. However, most of existing indexing techniques have difficulties to address this problem due to special properties of a document query, e.g. high dimensionality, sparse representation and semantic concern. Towards addressing this problem, we propose a two-level retrieval solution based on a document decomposition idea. A document is decomposed to a compact vector and a few document specific keywords by a dimension reduction approach. The compact vector embodies the major semantics of a document, and the document specific keywords complement the discriminative power lost in dimension reduction process. We adopt locality sensitive hashing (LSH) to index the compact vectors, which guarantees to quickly find a set of related documents according to the vector of a query document. Then we re-rank documents in this set by their document specific keywords. In experiments, we obtained promising results on various datasets in terms of both accuracy and performance. We demonstrated that this solution is able to index large-scale corpus for efficient similarity-based document retrieval.