Cross-language web page classification via dual knowledge transfer using nonnegative matrix tri-factorization

  • Authors:
  • Hua Wang;Heng Huang;Feiping Nie;Chris Ding

  • Affiliations:
  • University of Texas at Arlington, Arlington, TX, USA;University of Texas at Arlington, Arlington, TX, USA;University of Texas at Arlington, Arlington, TX, USA;University of Texas at Arlington, Arlington, TX, USA

  • Venue:
  • Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

The lack of sufficient labeled Web pages in many languages, especially for those uncommonly used ones, presents a great challenge to traditional supervised classification methods to achieve satisfactory Web page classification performance. To address this, we propose a novel Nonnegative Matrix Tri-factorization (NMTF) based Dual Knowledge Transfer (DKT) approach for cross-language Web page classification, which is based on the following two important observations. First, we observe that Web pages for a same topic from different languages usually share some common semantic patterns, though in different representation forms. Second, we also observe that the associations between word clusters and Web page classes are a more reliable carrier than raw words to transfer knowledge across languages. With these recognitions, we attempt to transfer knowledge from the auxiliary language, in which abundant labeled Web pages are available, to target languages, in which we want classify Web pages, through two different paths: word cluster approximations and the associations between word clusters and Web page classes. Due to the reinforcement between these two different knowledge transfer paths, our approach can achieve better classification accuracy. We evaluate the proposed approach in extensive experiments using a real world cross-language Web page data set. Promising results demonstrate the effectiveness of our approach that is consistent with our theoretical analyses.