Combining labeled and unlabeled data with co-training
COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Text Classification from Labeled and Unlabeled Documents using EM
Machine Learning - Special issue on information retrieval
Using web structure for classifying and describing web pages
Proceedings of the 11th international conference on World Wide Web
Challenges in web search engines
ACM SIGIR Forum
Spam, damn spam, and statistics: using statistical analysis to locate spam web pages
Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
Detecting phrase-level duplication on the world wide web
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
GI '05 Proceedings of Graphics Interface 2005
An EM Based Training Algorithm for Cross-Language Text Categorization
WI '05 Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence
Topical TrustRank: using topicality to combat web spam
Proceedings of the 15th international conference on World Wide Web
Detecting spam web pages through content analysis
Proceedings of the 15th international conference on World Wide Web
A reference collection for web spam
ACM SIGIR Forum
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Spam Filtering Using Statistical Data Compression Models
The Journal of Machine Learning Research
Know your neighbors: web spam detection using the web topology
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Statistical Relational Learning (Adaptive Computation and Machine Learning)
Introduction to Statistical Relational Learning (Adaptive Computation and Machine Learning)
Combating web spam with trustrank
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Can chinese web pages be classified with english data source?
Proceedings of the 17th international conference on World Wide Web
Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
AIRWeb '09, 5th International Workshop on Adversarial Information Retrieval on the Web
Web spam challenge proposal for filtering in archives
Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Co-training for cross-lingual sentiment classification
ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Cross-language text classification using structural correspondence learning
ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Cross language text classification by model translation and semi-supervised learning
EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Using information from the target language to improve crosslingual text classification
IceTAL'10 Proceedings of the 7th international conference on Advances in natural language processing
Cross lingual text classification by mining multilingual topics from wikipedia
Proceedings of the fourth ACM international conference on Web search and data mining
LIBSVM: A library for support vector machines
ACM Transactions on Intelligent Systems and Technology (TIST)
Web spam classification: a few features worth more
Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality
Handbook of Natural Language Processing and Machine Translation: DARPA Global Autonomous Language Exploitation
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Efficient and effective spam filtering and re-ranking for large web datasets
Information Retrieval
A survey on web archiving initiatives
TPDL'11 Proceedings of the 15th international conference on Theory and practice of digital libraries: research and advanced technology for digital libraries
Content-based trust and bias classification via biclustering
Proceedings of the 2nd Joint WICOW/AIRWeb Workshop on Web Quality
Hi-index | 0.00 |
While Web spam training data exists in English, we face an expensive human labeling procedure if we want to filter a Web domain in a different language. In this paper we overview how existing content and link based classification techniques work, how models can be "translated" from English into another language, and how language-dependent and independent methods combine. In particular we show that simple bag-of-words translation works very well and in this procedure we may also rely on mixed language Web hosts, i.e. those that contain an English translation of part of the local language text. Our experiments are conducted on the ClueWeb09 corpus as the training English collection and a large Portuguese crawl of the Portuguese Web Archive. To foster further research, we provide labels and precomputed values of term frequencies, content and link based features for both ClueWeb09 and the Portuguese data.