Tackling content spamming with a term weighting scheme

  • Authors:
  • Saptaditya Maiti;Deba P. Mandal;Pabitra Mitra

  • Affiliations:
  • Machine Intelligence Unit, Indian Statistical institute, Kolkata, India;Machine Intelligence Unit, Indian Statistical institute, Kolkata, India;Indian Institute of Technology, Kharagpur, India

  • Venue:
  • Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

A term weighting scheme is described here which is able to circumvent the effect of web spam and content stuffing such as keyword stuffing, hidden unrelated text and meta tag stuffing. This scheme is composed of three components, namely, term frequency, inverse document frequency and document weight. The first two are the conventional components of tf-idf schema but their functional forms are different than existing ones. The document weight includes a normalized form of Shannon's entropy in the frequency distributions of terms such that it can provide an estimate of the information content of a document. Mainly due to the incorporation of the document weight in the scheme, the scheme has the capability of reducing the relevance score of a maliciously manipulated document to an extent. The performance of the scheme is verified on some artificially generated spam versions of TIPSTER Text Research Collections and is found to be effective against keyword stuffing based content spamming.