Cross-lingual web spam classification

Authors:
András Garzó;Bálint Daróczy;Tamás Kiss;Dávid Siklósi;András A. Benczúr
Affiliations:
Institute for Computer Science and Control, Hungarian Academy of Sciences (MTA SZTAKI), Eötvös University, Budapest, Hungary;Institute for Computer Science and Control, Hungarian Academy of Sciences (MTA SZTAKI), Eötvös University, Budapest, Hungary;Institute for Computer Science and Control, Hungarian Academy of Sciences (MTA SZTAKI), Eötvös University, Budapest, Hungary;Institute for Computer Science and Control, Hungarian Academy of Sciences (MTA SZTAKI), Eötvös University, Budapest, Hungary;Institute for Computer Science and Control, Hungarian Academy of Sciences (MTA SZTAKI), Eötvös University, Budapest, Hungary
Venue:
Proceedings of the 22nd international conference on World Wide Web companion
Year:
2013

Citing 32
Cited 0

Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Using web structure for classifying and describing web pages

Proceedings of the 11th international conference on World Wide Web
Challenges in web search engines

ACM SIGIR Forum
Spam, damn spam, and statistics: using statistical analysis to locate spam web pages

Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
Detecting phrase-level duplication on the world wide web

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Case studies in the use of ROC curve analysis for sensor-based estimates in human computer interaction

GI '05 Proceedings of Graphics Interface 2005
Spam: It's Not Just for Inboxes Anymore

Computer
An EM Based Training Algorithm for Cross-Language Text Categorization

WI '05 Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence
Topical TrustRank: using topicality to combat web spam

Proceedings of the 15th international conference on World Wide Web
Detecting spam web pages through content analysis

Proceedings of the 15th international conference on World Wide Web
A reference collection for web spam

ACM SIGIR Forum
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Spam Filtering Using Statistical Data Compression Models

The Journal of Machine Learning Research
Know your neighbors: web spam detection using the web topology

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Statistical Relational Learning (Adaptive Computation and Machine Learning)

Introduction to Statistical Relational Learning (Adaptive Computation and Machine Learning)
Combating web spam with trustrank

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Can chinese web pages be classified with english data source?

Proceedings of the 17th international conference on World Wide Web
Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web

AIRWeb '09, 5th International Workshop on Adversarial Information Retrieval on the Web
Web spam challenge proposal for filtering in archives

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Co-training for cross-lingual sentiment classification

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Cross-language text classification using structural correspondence learning

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Cross language text classification by model translation and semi-supervised learning

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Using information from the target language to improve crosslingual text classification

IceTAL'10 Proceedings of the 7th international conference on Advances in natural language processing
Cross lingual text classification by mining multilingual topics from wikipedia

Proceedings of the fourth ACM international conference on Web search and data mining
LIBSVM: A library for support vector machines

ACM Transactions on Intelligent Systems and Technology (TIST)
Web spam classification: a few features worth more

Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality
Handbook of Natural Language Processing and Machine Translation: DARPA Global Autonomous Language Exploitation

Handbook of Natural Language Processing and Machine Translation: DARPA Global Autonomous Language Exploitation
Cross-language web page classification via dual knowledge transfer using nonnegative matrix tri-factorization

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Efficient and effective spam filtering and re-ranking for large web datasets

Information Retrieval
A survey on web archiving initiatives

TPDL'11 Proceedings of the 15th international conference on Theory and practice of digital libraries: research and advanced technology for digital libraries
Content-based trust and bias classification via biclustering

Proceedings of the 2nd Joint WICOW/AIRWeb Workshop on Web Quality

Quantified Score

Hi-index	0.00

Visualization

Abstract

While Web spam training data exists in English, we face an expensive human labeling procedure if we want to filter a Web domain in a different language. In this paper we overview how existing content and link based classification techniques work, how models can be "translated" from English into another language, and how language-dependent and independent methods combine. In particular we show that simple bag-of-words translation works very well and in this procedure we may also rely on mixed language Web hosts, i.e. those that contain an English translation of part of the local language text. Our experiments are conducted on the ClueWeb09 corpus as the training English collection and a large Portuguese crawl of the Portuguese Web Archive. To foster further research, we provide labels and precomputed values of term frequencies, content and link based features for both ClueWeb09 and the Portuguese data.