Topic-bridged PLSA for cross-domain text classification

  • Authors:
  • Gui-Rong Xue;Wenyuan Dai;Qiang Yang;Yong Yu

  • Affiliations:
  • Shanghai Jiao-Tong University, Shanghai, China;Shanghai Jiao Tong University, Shanghai, China;Hong Kong University of Science and Technology, Hong Kong, Hong Kong;Shanghai Jiao Tong University, Shanghai, China

  • Venue:
  • Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

In many Web applications, such as blog classification and new-sgroup classification, labeled data are in short supply. It often happens that obtaining labeled data in a new domain is expensive and time consuming, while there may be plenty of labeled data in a related but different domain. Traditional text classification ap-proaches are not able to cope well with learning across different domains. In this paper, we propose a novel cross-domain text classification algorithm which extends the traditional probabilistic latent semantic analysis (PLSA) algorithm to integrate labeled and unlabeled data, which come from different but related domains, into a unified probabilistic model. We call this new model Topic-bridged PLSA, or TPLSA. By exploiting the common topics between two domains, we transfer knowledge across different domains through a topic-bridge to help the text classification in the target domain. A unique advantage of our method is its ability to maximally mine knowledge that can be transferred between domains, resulting in superior performance when compared to other state-of-the-art text classification approaches. Experimental eval-uation on different kinds of datasets shows that our proposed algorithm can improve the performance of cross-domain text classification significantly.