Statistical cross-language Web content quality assessment

  • Authors:
  • Guang-Gang Geng;Li-Ming Wang;Wei Wang;An-Lei Hu;Shuo Shen

  • Affiliations:
  • China Internet Network Information Center, Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, PR China;China Internet Network Information Center, Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, PR China;China Internet Network Information Center, Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, PR China;China Internet Network Information Center, Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, PR China;China Internet Network Information Center, Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, PR China

  • Venue:
  • Knowledge-Based Systems
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Cross-language Web content quality assessment plays an important role in many Web content processing applications. In the previous research, natural language processing, heuristic content and term frequency-inverse document frequency features based statistical systems have proven effective for Web content quality assessment. However, these are language-dependent features, which are not suitable for cross-language ranking. This paper proposes a cross-language Web content quality assessment method. First multi-modal language-independent features are extracted. The extracting features include character features, domain registration features, two-layer hyperlink analysis features and third-party Web service features. All the extracted features are then fused. Based on the fused features, feature selection is carried out to get a new eigenspace. Finally cross-language Web content quality model on the eigenspace can be learned. The experiments on ECML/PKDD 2010 Discovery Challenge cross-language datasets demonstrate that every scale feature has discriminability; different modalities of features are complementary to each other; and the feature selection is effective for statistical learning based cross-language Web content quality assessment.