Tackling content spamming with a term weighting scheme

Authors:
Saptaditya Maiti;Deba P. Mandal;Pabitra Mitra
Affiliations:
Machine Intelligence Unit, Indian Statistical institute, Kolkata, India;Machine Intelligence Unit, Indian Statistical institute, Kolkata, India;Indian Institute of Technology, Kharagpur, India
Venue:
Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data
Year:
2011

Citing 8
Cited 0

Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Probabilistic models of information retrieval based on measuring the divergence from randomness

ACM Transactions on Information Systems (TOIS)
Challenges in web search engines

ACM SIGIR Forum
The connectivity sonar: detecting site functionality by structural patterns

Proceedings of the fourteenth ACM conference on Hypertext and hypermedia
Entropy Measures,Maximum Entropy Principle and Emerging Applications

Entropy Measures,Maximum Entropy Principle and Emerging Applications
Detecting spam web pages through content analysis

Proceedings of the 15th international conference on World Wide Web
Introduction to Information Retrieval

Introduction to Information Retrieval
Adversarial Web Search

Foundations and Trends in Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

A term weighting scheme is described here which is able to circumvent the effect of web spam and content stuffing such as keyword stuffing, hidden unrelated text and meta tag stuffing. This scheme is composed of three components, namely, term frequency, inverse document frequency and document weight. The first two are the conventional components of tf-idf schema but their functional forms are different than existing ones. The document weight includes a normalized form of Shannon's entropy in the frequency distributions of terms such that it can provide an estimate of the information content of a document. Mainly due to the incorporation of the document weight in the scheme, the scheme has the capability of reducing the relevance score of a maliciously manipulated document to an extent. The performance of the scheme is verified on some artificially generated spam versions of TIPSTER Text Research Collections and is found to be effective against keyword stuffing based content spamming.