Undue influence: eliminating the impact of link plagiarism on web search rankings

Authors:
Baoning Wu;Brian D. Davison
Affiliations:
Lehigh University, Bethlehem, PA;Lehigh University, Bethlehem, PA
Venue:
Proceedings of the 2006 ACM symposium on Applied computing
Year:
2006

Citing 19
Cited 5

Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Improved algorithms for topic distillation in a hyperlinked environment

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Trawling the Web for emerging cyber-communities

WWW '99 Proceedings of the eighth international conference on World Wide Web
Authoritative sources in a hyperlinked environment

Journal of the ACM (JACM)
Finding replicated Web collections

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
The stochastic approach for link-structure analysis (SALSA) and the TKC effect

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Efficient identification of Web communities

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
A comparison of techniques to find mirrored hosts on the WWW

Journal of the American Society for Information Science
Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction

Proceedings of the 10th international conference on World Wide Web
Improvement of HITS-based algorithms on web documents

Proceedings of the 11th international conference on World Wide Web
Learning to Probabilistically Identify Authoritative Documents

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Finding Near-Replicas of Documents and Servers on the Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Challenges in web search engines

ACM SIGIR Forum
The connectivity sonar: detecting site functionality by structural patterns

Proceedings of the fourteenth ACM conference on Hypertext and hypermedia
On the Evolution of Clusters of Near-Duplicate Web Pages

LA-WEB '03 Proceedings of the First Conference on Latin American Web Congress
Spam, damn spam, and statistics: using statistical analysis to locate spam web pages

Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
Identifying link farm spam pages

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Combating web spam with trustrank

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30

Implementation and evaluation of a quality-based search engine

Proceedings of the seventeenth conference on Hypertext and hypermedia
Web site visibility evaluation

Journal of the American Society for Information Science and Technology
Combating link spam by noisy link analysis

ADMA'10 Proceedings of the 6th international conference on Advanced data mining and applications: Part I
Adversarial Web Search

Foundations and Trends in Information Retrieval
Survey on web spam detection: principles and algorithms

ACM SIGKDD Explorations Newsletter

Quantified Score

Hi-index	0.00

Visualization

Abstract

Link farm spam and replicated pages can greatly deteriorate link-based ranking algorithms such as HITS. In order to identify and neutralize link farm spam and replicated pages, we look for sufficient material copied from one page to another. In particular, we focus on the use of "complete hyperlinks" to distinguish link targets by the anchor text used. We build and analyze the bipartite graph of documents and their complete hyperlinks to find pages that share anchor text and link targets. Link farms and replicated pages are identified in this process, permitting the influence of problematic links to be reduced in a weighted adjacency matrix. Experiments and user evaluations show significant improvement in the quality of results produced using HITS-like methods.