Reducing information redundancy in search results

Authors:
Yannis Plegas;Sofia Stamou
Affiliations:
University of Patras, Greece;Ionian University, Patras University, Greece
Venue:
Proceedings of the 28th Annual ACM Symposium on Applied Computing
Year:
2013

Citing 26
Cited 0

Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
The use of MMR, diversity-based reranking for reordering documents and producing summaries

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
IR evaluation methods for retrieving highly relevant documents

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Collection statistics for fast duplicate document detection

ACM Transactions on Information Systems (TOIS)
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Novelty and redundancy detection in adaptive filtering

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Identifying and Filtering Near-Duplicate Documents

COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
Methods for identifying versioned and plagiarized documents

Journal of the American Society for Information Science and Technology
Beyond independent relevance: methods and evaluation metrics for subtopic retrieval

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Verbs semantics and lexical selection

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
Learning to paraphrase: an unsupervised approach using multiple-sequence alignment

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Finding near-duplicate web pages: a large-scale evaluation of algorithms

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Less is more: probabilistic models for retrieving fewer relevant documents

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Do not crawl in the dust: different urls with similar text

Proceedings of the 16th international conference on World Wide Web
Multiple-signal duplicate detection for search evaluation

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
SpotSigs: robust and efficient near duplicate detection in large web collections

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Novelty and diversity in information retrieval evaluation

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Information Retrieval

Introduction to Information Retrieval
Word sense disambiguation: A survey

ACM Computing Surveys (CSUR)
Diversifying search results

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Portfolio theory of information retrieval

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Automatic evaluation of text coherence: models and representations

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Redundancy, diversity and interdependent document relevance

ACM SIGIR Forum
Detecting duplicate web documents using clickthrough data

Proceedings of the fourth ACM international conference on Web search and data mining
Efficient diversity-aware search

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Clustering by compression

IEEE Transactions on Information Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

It is well-known that the web contains many duplicate and near-duplicate documents. Despite the efforts that have been put towards equipping search engines with duplicate detection algorithms, still there are cases where the documents retrieved in response to web queries contain redundant information. In this paper, we are concerned with effectively identifying and reducing redundant information in search results. In particular, we describe how we automatically detect content that is lexically and/or semantically duplicated across search results and we introduce a novel algorithm that upon the detection of significant (i.e., above a given threshold) content duplication, it filters out redundant information. Information filtering takes place in two-steps depending on whether we are dealing with documents of (nearly) identical lexical content or with documents of lexically distinct but semantically equivalent content. In the first case, our algorithm retains in the result list the document that is the most relevant to the query intention and removes duplicates. In the second case, our algorithm merges into a single text, which we call SuperText, the documents of redundant information in a way that every document contributes diverse semantic content to the generated SuperText. Additionally, the algorithm re-ranks the remaining documents based on their contextual relevance to the query intention. The experimental evaluation of our approach demonstrates that it is very effective in identifying lexical and semantic information redundancy across search results. In addition, we have found that our algorithm manages to filter out successfully content duplication from the results list and the SuperTexts it generates for reducing information redundancy are syntactically and semantically coherent texts.