Machine Learning
Mining the peanut gallery: opinion extraction and semantic classification of product reviews
WWW '03 Proceedings of the 12th international conference on World Wide Web
Challenges in web search engines
ACM SIGIR Forum
Information-theoretic co-clustering
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Editorial: special issue on learning from imbalanced data sets
ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
GI '05 Proceedings of Graphics Interface 2005
Topical TrustRank: using topicality to combat web spam
Proceedings of the 15th international conference on World Wide Web
A reference collection for web spam
ACM SIGIR Forum
Know your neighbors: web spam detection using the web topology
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Combating web spam with trustrank
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Opinion Mining and Sentiment Analysis
Foundations and Trends in Information Retrieval
Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
AIRWeb '09, 5th International Workshop on Adversarial Information Retrieval on the Web
LIBSVM: A library for support vector machines
ACM Transactions on Intelligent Systems and Technology (TIST)
Web spam classification: a few features worth more
Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality
Adapted vocabularies for generic visual categorization
ECCV'06 Proceedings of the 9th European conference on Computer Vision - Volume Part IV
Cross-lingual web spam classification
Proceedings of the 22nd international conference on World Wide Web companion
Hi-index | 0.00 |
In this paper we improve trust, bias and factuality classification over Web data on the domain level. Unlike the majority of literature in this area that aims at extracting opinion and handling short text on the micro level, we aim to aid a researcher or an archivist in obtaining a large collection that, on the high level, originates from unbiased and trustworthy sources. Our method generates features as Jensen-Shannon distances from centers in a host-term biclustering. On top of the distance features, we apply kernel methods and also combine with baseline text classifiers. We test our method on the ECML/PKDD Discovery Challenge data set DC2010. Our method improves over the best achieved text classification NDCG results by over 3--10% for neutrality, bias and trustworthiness. The fact that the ECML/PKDD Discovery Challenge 2010 participants reached an AUC only slightly above 0.5 indicates the hardness of the task.