Latent space domain transfer between high dimensional overlapping distributions

Authors:
Sihong Xie;Wei Fan;Jing Peng;Olivier Verscheure;Jiangtao Ren
Affiliations:
Sun Yat-Sen University, Guangzhou, China;IBM T.J. Watson Research Center, New York, USA;Montclair State University, Montclair, USA;IBM T.J. Watson Research Center, New York, USA;Sun Yat-Sen University, Guangzhou, China
Venue:
Proceedings of the 18th international conference on World wide web
Year:
2009

Citing 13
Cited 5

The nature of statistical learning theory

The nature of statistical learning theory
Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
Clustering Large Graphs via the Singular Value Decomposition

Machine Learning
K-means clustering via principal component analysis

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples

The Journal of Machine Learning Research
Discriminative learning for differing training and test distributions

Proceedings of the 24th international conference on Machine learning
Boosting for transfer learning

Proceedings of the 24th international conference on Machine learning
Asymptotic Bayesian generalization error when training and test distributions are different

Proceedings of the 24th international conference on Machine learning
Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption

The Journal of Machine Learning Research
Can chinese web pages be classified with english data source?

Proceedings of the 17th international conference on World Wide Web
Topic-bridged PLSA for cross-domain text classification

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Knowledge transfer via multiple model local structure mapping

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Spectral domain-transfer learning

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

Cross-domain sentiment classification via spectral feature alignment

Proceedings of the 19th international conference on World wide web
Collaborative Dual-PLSA: mining distinction and commonality across multiple domains for text classification

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Cross validation framework to choose amongst models and datasets for transfer learning

ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part III
A cross-domain adaptation method for sentiment classification using probabilistic latent analysis

Proceedings of the 20th ACM international conference on Information and knowledge management
Multi-domain active learning for text classification

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Transferring knowledge from one domain to another is challenging due to a number of reasons. Since both conditional and marginal distribution of the training data and test data are non-identical, model trained in one domain, when directly applied to a different domain, is usually low in accuracy. For many applications with large feature sets, such as text document, sequence data, medical data, image data of different resolutions, etc. two domains usually do not contain exactly the same features, thus introducing large numbers of "missing values" when considered over the union of features from both domains. In other words, its marginal distributions are at most overlapping. In the same time, these problems are usually high dimensional, such as, several thousands of features. Thus, the combination of high dimensionality and missing values make the relationship in conditional probabilities between two domains hard to measure and model. To address these challenges, we propose a framework that first brings the marginal distributions of two domains closer by "filling up" those missing values of disjoint features. Afterwards, it looks for those comparable sub-structures in the "latent-space" as mapped from the expanded feature vector, where both marginal and conditional distribution are similar. With these sub-structures in latent space, the proposed approach then find common concepts that are transferable across domains with high probability. During prediction, unlabeled instances are treated as "queries", the mostly related labeled instances from out-domain are retrieved, and the classification is made by weighted voting using retrieved out-domain examples. We formally show that importing feature values across domains and latent semantic index can jointly make the distributions of two related domains easier to measure than in original feature space, the nearest neighbor method employed to retrieve related out domain examples is bounded in error when predicting in-domain examples. Software and datasets are available for download.