Content-based analysis to detect Arabic web spam

Authors:
Mohammed Al-Kabi;Heider Wahsheh;Izzat Alsmadi;Emad Al-Shawakfa;Abdullah Wahbeh;Ahmed Al-Hmoud
Affiliations:
;;;;;
Venue:
Journal of Information Science
Year:
2012

Citing 18
Cited 0

A large-scale study of the evolution of web pages

WWW '03 Proceedings of the 12th international conference on World Wide Web
Spam, damn spam, and statistics: using statistical analysis to locate spam web pages

Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
Detecting phrase-level duplication on the world wide web

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Detecting spam web pages through content analysis

Proceedings of the 15th international conference on World Wide Web
Improving web spam classification using rank-time features

AIRWeb '07 Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
Know your neighbors: web spam detection using the web topology

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Boosting the Performance of Web Spam Detection with Ensemble Under-Sampling Classification

FSKD '07 Proceedings of the Fourth International Conference on Fuzzy Systems and Knowledge Discovery - Volume 04
Combating web spam with trustrank

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Analyzing and Detecting Review Spam

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Web spam identification through content and hyperlinks

AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Looking into the past to better classify web spam

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Web spam identification through language model analysis

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Using evidence based content trust model for spam detection

Expert Systems with Applications: An International Journal
Web spam detection: new classification features based on qualified link analysis and language models

IEEE Transactions on Information Forensics and Security
Learning to detect web spam by genetic programming

WAIM'10 Proceedings of the 11th international conference on Web-age information management
Evaluating Google queries based on language preferences

Journal of Information Science
Survey on web spam detection: principles and algorithms

ACM SIGKDD Explorations Newsletter
The Automatic Evaluation of Website Metrics and State

International Journal of Web-Based Learning and Teaching Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

Search engines are important outlets for information query and retrieval. They have to deal with the continual increase of information available on the web, and provide users with convenient access to such huge amounts of information. Furthermore, with this huge amount of information, a more complex challenge that continuously gets more and more difficult to illuminate is the spam in web pages. For several reasons, web spammers try to intrude in the search results and inject artificially biased results in favour of their websites or pages. Spam pages are added to the internet on a daily basis, thus making it difficult for search engines to keep up with the fast-growing and dynamic nature of the web, especially since spammers tend to add more keywords to their websites to deceive the search engines and increase the rank of their pages. In this research, we have investigated four different classification algorithms (na脙炉ve Bayes, decision tree, SVM and K-NN) to detect Arabic web spam pages, based on content. The three groups of datasets used, with 1%, 15% and 50% spam contents, were collected using a crawler that was customized for this study. Spam pages were classified manually. Different tests and comparisons have revealed that the Decision Tree was the best classifier for this purpose.