Statistical feature extraction for cross-language web content quality assessment

Authors:
Guang-Gang Geng;Xiao-Dong Li;Li-Ming Wang;Wei Wang;Shuo Shen
Affiliations:
China Internet Network Information Center/ Computer Network Information Center, Chinese Academy of Science, Beijing, China;China Internet Network Information Center/ Computer Network Information Center, Chinese Academy of Science, Beijing, China;China Internet Network Information Center/ Computer Network Information Center, Chinese Academy of Science, Beijing, China;China Internet Network Information Center/ Computer Network Information Center, Chinese Academy of Science, Beijing, China;China Internet Network Information Center/ Computer Network Information Center, Chinese Academy of Science, Beijing, China
Venue:
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Year:
2011

Citing 3
Cited 0

Beyond PageRank: machine learning for static ranking

Proceedings of the 15th international conference on World Wide Web
Know your neighbors: web spam detection using the web topology

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Link analysis for Web spam detection

ACM Transactions on the Web (TWEB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Web content quality assessment is a typical static ranking problem. Heuristic content and TFIDF features based statistical systems have proven effective for Web content quality assessment. But they are all language dependent features, which are not suitable for cross-language ranking. In this paper, we fuse a series of language-independent features including hostname features, domain registration features, two-layer hyperlink analysis features and third-party Web service features to assess the Web content quality. The experiments on ECML/PKDD 2010 Discovery Challenge cross-language datasets show that the assessment is effective.