Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
The use of MMR, diversity-based reranking for reordering documents and producing summaries
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
IR evaluation methods for retrieving highly relevant documents
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Collection statistics for fast duplicate document detection
ACM Transactions on Information Systems (TOIS)
Similarity estimation techniques from rounding algorithms
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Novelty and redundancy detection in adaptive filtering
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Identifying and Filtering Near-Duplicate Documents
COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
Methods for identifying versioned and plagiarized documents
Journal of the American Society for Information Science and Technology
Beyond independent relevance: methods and evaluation metrics for subtopic retrieval
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Verbs semantics and lexical selection
ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
Learning to paraphrase: an unsupervised approach using multiple-sequence alignment
NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Finding near-duplicate web pages: a large-scale evaluation of algorithms
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Less is more: probabilistic models for retrieving fewer relevant documents
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Do not crawl in the dust: different urls with similar text
Proceedings of the 16th international conference on World Wide Web
Multiple-signal duplicate detection for search evaluation
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
SpotSigs: robust and efficient near duplicate detection in large web collections
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Novelty and diversity in information retrieval evaluation
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Information Retrieval
Introduction to Information Retrieval
Word sense disambiguation: A survey
ACM Computing Surveys (CSUR)
Proceedings of the Second ACM International Conference on Web Search and Data Mining
Portfolio theory of information retrieval
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Automatic evaluation of text coherence: models and representations
IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Redundancy, diversity and interdependent document relevance
ACM SIGIR Forum
Detecting duplicate web documents using clickthrough data
Proceedings of the fourth ACM international conference on Web search and data mining
Efficient diversity-aware search
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
IEEE Transactions on Information Theory
Hi-index | 0.00 |
It is well-known that the web contains many duplicate and near-duplicate documents. Despite the efforts that have been put towards equipping search engines with duplicate detection algorithms, still there are cases where the documents retrieved in response to web queries contain redundant information. In this paper, we are concerned with effectively identifying and reducing redundant information in search results. In particular, we describe how we automatically detect content that is lexically and/or semantically duplicated across search results and we introduce a novel algorithm that upon the detection of significant (i.e., above a given threshold) content duplication, it filters out redundant information. Information filtering takes place in two-steps depending on whether we are dealing with documents of (nearly) identical lexical content or with documents of lexically distinct but semantically equivalent content. In the first case, our algorithm retains in the result list the document that is the most relevant to the query intention and removes duplicates. In the second case, our algorithm merges into a single text, which we call SuperText, the documents of redundant information in a way that every document contributes diverse semantic content to the generated SuperText. Additionally, the algorithm re-ranks the remaining documents based on their contextual relevance to the query intention. The experimental evaluation of our approach demonstrates that it is very effective in identifying lexical and semantic information redundancy across search results. In addition, we have found that our algorithm manages to filter out successfully content duplication from the results list and the SuperTexts it generates for reducing information redundancy are syntactically and semantically coherent texts.