Improving web information indexing and retrieval based on center block duplication detection

Authors:
Tyrone Cadenhead;Jinlin Chen;Terry Cook
Affiliations:
Department of Computer Science, University of Dallas, Texas, USA.;Department of Computer Science, Queens College, City University of New York, Flushing, NY 11367, USA.;Department of Computer Science, Graduate Centre, City University of New York, New York 10016, USA
Venue:
International Journal of Innovative Computing and Applications
Year:
2008

Citing 26
Cited 1

Discrimination of authorship using visualization

Information Processing and Management: an International Journal
Copy detection mechanisms for digital documents

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Mirror, mirror on the Web: a study of host pairs with replicated content

WWW '99 Proceedings of the eighth international conference on World Wide Web
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Finding replicated Web collections

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
A comparison of techniques to find mirrored hosts on the WWW

Journal of the American Society for Information Science
Function-based object model towards website adaptation

Proceedings of the 10th international conference on World Wide Web
Collection statistics for fast duplicate document detection

ACM Transactions on Information Systems (TOIS)
Novelty and redundancy detection in adaptive filtering

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Detecting similar documents using salient terms

Proceedings of the eleventh international conference on Information and knowledge management
Comparison of Overlap Detection Techniques

ICCS '02 Proceedings of the International Conference on Computational Science-Part I
Visual Based Content Understanding towards Web Adaptation

AH '02 Proceedings of the Second International Conference on Adaptive Hypermedia and Adaptive Web-Based Systems
Methods for identifying versioned and plagiarized documents

Journal of the American Society for Information Science and Technology
Improving pseudo-relevance feedback in web information retrieval using web page segmentation

WWW '03 Proceedings of the 12th international conference on World Wide Web
Detecting web page structure for adaptive viewing on small form factor devices

WWW '03 Proceedings of the 12th international conference on World Wide Web
Models and Algorithms for Duplicate Document Detection

ICDAR '99 Proceedings of the Fifth International Conference on Document Analysis and Recognition
Recognition of Common Areas in a Web Page Using Visual Information: a possible application in a page classification

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
On the Evolution of Clusters of Near-Duplicate Web Pages

LA-WEB '03 Proceedings of the First Conference on Latin American Web Congress
Online duplicate document detection: signature reliability in a dynamic retrieval environment

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Spam, damn spam, and statistics: using statistical analysis to locate spam web pages

Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
An efficient parts-based near-duplicate and sub-image retrieval system

Proceedings of the 12th annual ACM international conference on Multimedia
Detecting phrase-level duplication on the world wide web

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Redundant documents and search effectiveness

Proceedings of the 14th ACM international conference on Information and knowledge management
The Smart/Empire TIPSTER IR system

TIPSTER '98 Proceedings of a workshop on held at Baltimore, Maryland: October 13-15, 1998
A systematic study of parameter correlations in large scale duplicate document detection

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining

Towards optimisation of the management of resources in the CloudSim simulator

International Journal of Innovative Computing and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Duplicated information in today's Web has serious negative impact to Web search engines in that it increases the size of the index and results in low efficiency for Web information retrieval. One important fact is that a large amount of Web content duplication happens at block level in addition to site and page level due to various reasons. Besides, when searching through the Web, in most cases the desired information is located at the center block of a relevant page. Based on these two observations, we propose an efficient block level duplication detection algorithm based on resemblance transitivity, and index center blocks instead of entire Web pages for Web information retrieval. Experiments show that these strategies can effectively reduce index size and index construction time without sacrificing the effectiveness of Web information retrieval.