Cross-lingual web spam classification

  • Authors:
  • András Garzó;Bálint Daróczy;Tamás Kiss;Dávid Siklósi;András A. Benczúr

  • Affiliations:
  • Institute for Computer Science and Control, Hungarian Academy of Sciences (MTA SZTAKI), Eötvös University, Budapest, Hungary;Institute for Computer Science and Control, Hungarian Academy of Sciences (MTA SZTAKI), Eötvös University, Budapest, Hungary;Institute for Computer Science and Control, Hungarian Academy of Sciences (MTA SZTAKI), Eötvös University, Budapest, Hungary;Institute for Computer Science and Control, Hungarian Academy of Sciences (MTA SZTAKI), Eötvös University, Budapest, Hungary;Institute for Computer Science and Control, Hungarian Academy of Sciences (MTA SZTAKI), Eötvös University, Budapest, Hungary

  • Venue:
  • Proceedings of the 22nd international conference on World Wide Web companion
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

While Web spam training data exists in English, we face an expensive human labeling procedure if we want to filter a Web domain in a different language. In this paper we overview how existing content and link based classification techniques work, how models can be "translated" from English into another language, and how language-dependent and independent methods combine. In particular we show that simple bag-of-words translation works very well and in this procedure we may also rely on mixed language Web hosts, i.e. those that contain an English translation of part of the local language text. Our experiments are conducted on the ClueWeb09 corpus as the training English collection and a large Portuguese crawl of the Portuguese Web Archive. To foster further research, we provide labels and precomputed values of term frequencies, content and link based features for both ClueWeb09 and the Portuguese data.