Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval
SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Probabilistic models of information retrieval based on measuring the divergence from randomness
ACM Transactions on Information Systems (TOIS)
Challenges in web search engines
ACM SIGIR Forum
The connectivity sonar: detecting site functionality by structural patterns
Proceedings of the fourteenth ACM conference on Hypertext and hypermedia
Entropy Measures,Maximum Entropy Principle and Emerging Applications
Entropy Measures,Maximum Entropy Principle and Emerging Applications
Detecting spam web pages through content analysis
Proceedings of the 15th international conference on World Wide Web
Introduction to Information Retrieval
Introduction to Information Retrieval
Foundations and Trends in Information Retrieval
Hi-index | 0.00 |
A term weighting scheme is described here which is able to circumvent the effect of web spam and content stuffing such as keyword stuffing, hidden unrelated text and meta tag stuffing. This scheme is composed of three components, namely, term frequency, inverse document frequency and document weight. The first two are the conventional components of tf-idf schema but their functional forms are different than existing ones. The document weight includes a normalized form of Shannon's entropy in the frequency distributions of terms such that it can provide an estimate of the information content of a document. Mainly due to the incorporation of the document weight in the scheme, the scheme has the capability of reducing the relevance score of a maliciously manipulated document to an extent. The performance of the scheme is verified on some artificially generated spam versions of TIPSTER Text Research Collections and is found to be effective against keyword stuffing based content spamming.