Cleaning search results using term distance features

Authors:
Josh Attenberg;Torsten Suel
Affiliations:
Polytechnic University, Brooklyn, NY;Polytechnic University, Brooklyn, NY
Venue:
AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Year:
2008

Citing 13
Cited 6

The connectivity sonar: detecting site functionality by structural patterns

Proceedings of the fourteenth ACM conference on Hypertext and hypermedia
TextTiling: segmenting text into multi-paragraph subtopic passages

Computational Linguistics
Multi-paragraph segmentation of expository text

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
Identifying link farm spam pages

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Link spam alliances

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Detecting spam web pages through content analysis

Proceedings of the 15th international conference on World Wide Web
A picture of search

InfoScale '06 Proceedings of the 1st international conference on Scalable information systems
Link spam detection based on mass estimation

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Web projections: learning from contextual subgraphs of the web

Proceedings of the 16th international conference on World Wide Web
Improving web spam classifiers using link structure

AIRWeb '07 Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
Web spam detection via commercial intent analysis

AIRWeb '07 Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
Combating web spam with trustrank

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Thwarting the nigritude ultramarine: learning to identify link spam

ECML'05 Proceedings of the 16th European conference on Machine Learning

Looking into the past to better classify web spam

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Social spam detection

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Web spam classification: a few features worth more

Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality
Adversarial Web Search

Foundations and Trends in Information Retrieval
Evaluating Arabic spam classifiers using link analysis

Proceedings of the 3rd International Conference on Information and Communication Systems
Fighting against web spam: a novel propagation method based on click-through data

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

The presence of Web spam in query results is one of the critical challenges facing search engines today. While search engines try to combat the impact of spam pages on their results, the incentive for spammers to use increasingly sophisticated techniques has never been higher, since the commercial success of a Web page is strongly correlated to the number of views that page receives. This paper describes a term-based technique for spam detection based on a simple new summary data structure called Term Distance Histograms that tries to capture the topical structure of a page. We apply this technique as a post-filtering step to a major search engine. Our experiments show that we are able to detect many of the artificially generated spam pages that remained in the results of the engine. Specifically, our method is able to detect many web pages generated by utilizing techniques such as dumping, weaving, or phrase stitching [11], which are spamming techniques designed to achieve high rankings while still exhibiting many of the individual word frequency (and even bi-gram) properties of natural human text.