Discrimination of authorship using visualization
Information Processing and Management: an International Journal
Copy detection mechanisms for digital documents
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
Mirror, mirror on the Web: a study of host pairs with replicated content
WWW '99 Proceedings of the eighth international conference on World Wide Web
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Finding replicated Web collections
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
A comparison of techniques to find mirrored hosts on the WWW
Journal of the American Society for Information Science
Function-based object model towards website adaptation
Proceedings of the 10th international conference on World Wide Web
Collection statistics for fast duplicate document detection
ACM Transactions on Information Systems (TOIS)
Novelty and redundancy detection in adaptive filtering
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Detecting similar documents using salient terms
Proceedings of the eleventh international conference on Information and knowledge management
Comparison of Overlap Detection Techniques
ICCS '02 Proceedings of the International Conference on Computational Science-Part I
Visual Based Content Understanding towards Web Adaptation
AH '02 Proceedings of the Second International Conference on Adaptive Hypermedia and Adaptive Web-Based Systems
Methods for identifying versioned and plagiarized documents
Journal of the American Society for Information Science and Technology
Improving pseudo-relevance feedback in web information retrieval using web page segmentation
WWW '03 Proceedings of the 12th international conference on World Wide Web
Detecting web page structure for adaptive viewing on small form factor devices
WWW '03 Proceedings of the 12th international conference on World Wide Web
Models and Algorithms for Duplicate Document Detection
ICDAR '99 Proceedings of the Fifth International Conference on Document Analysis and Recognition
ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
On the Evolution of Clusters of Near-Duplicate Web Pages
LA-WEB '03 Proceedings of the First Conference on Latin American Web Congress
Online duplicate document detection: signature reliability in a dynamic retrieval environment
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Spam, damn spam, and statistics: using statistical analysis to locate spam web pages
Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
An efficient parts-based near-duplicate and sub-image retrieval system
Proceedings of the 12th annual ACM international conference on Multimedia
Detecting phrase-level duplication on the world wide web
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Redundant documents and search effectiveness
Proceedings of the 14th ACM international conference on Information and knowledge management
The Smart/Empire TIPSTER IR system
TIPSTER '98 Proceedings of a workshop on held at Baltimore, Maryland: October 13-15, 1998
A systematic study of parameter correlations in large scale duplicate document detection
PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Towards optimisation of the management of resources in the CloudSim simulator
International Journal of Innovative Computing and Applications
Hi-index | 0.00 |
Duplicated information in today's Web has serious negative impact to Web search engines in that it increases the size of the index and results in low efficiency for Web information retrieval. One important fact is that a large amount of Web content duplication happens at block level in addition to site and page level due to various reasons. Besides, when searching through the Web, in most cases the desired information is located at the center block of a relevant page. Based on these two observations, we propose an efficient block level duplication detection algorithm based on resemblance transitivity, and index center blocks instead of entire Web pages for Web information retrieval. Experiments show that these strategies can effectively reduce index size and index construction time without sacrificing the effectiveness of Web information retrieval.