Intra-document structural frequency features for semi-supervised domain adaptation

Authors:
Andrew Arnold;William W. Cohen
Affiliations:
Carnegie Mellon University, Pittsburgh, PA, USA;Carnegie Mellon University, Pittsburgh, PA, USA
Venue:
Proceedings of the 17th ACM conference on Information and knowledge management
Year:
2008

Citing 15
Cited 1

A Bayesian/Information Theoretic Model of Learning to Learn viaMultiple Task Sampling

Machine Learning - Special issue on inductive transfer
Multitask Learning

Machine Learning - Special issue on inductive transfer
Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Understanding captions in biomedical publications

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
One sense per discourse

HLT '91 Proceedings of the workshop on Speech and Natural Language
Object correspondence as a machine learning problem

ICML '05 Proceedings of the 22nd international conference on Machine learning
A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data

The Journal of Machine Learning Research
Composition of conditional random fields for transfer learning

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Exploiting domain structure for named entity recognition

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
The GENIA corpus: an annotated research abstract corpus in molecular biology domain

HLT '02 Proceedings of the second international conference on Human Language Technology Research
A Comparative Study of Methods for Transductive Transfer Learning

ICDMW '07 Proceedings of the Seventh IEEE International Conference on Data Mining Workshops
Domain adaptation with structural correspondence learning

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Domain adaptation for statistical classifiers

Journal of Artificial Intelligence Research
Learning with scope, with application to information extraction and classification

UAI'02 Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence

Frustratingly easy semi-supervised domain adaptation

DANLP 2010 Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this work we try to bridge the gap often encountered by researchers who find themselves with few or no labeled examples from their desired target domain, yet still have access to large amounts of labeled data from other related, but distinct source domains, and seemingly no way to transfer knowledge from one to the other. Experimentally, we focus on the problem of extracting protein mentions from academic publications in the field of biology, where the source domain data are abstracts labeled with protein mentions, and the target domain data are wholly unlabeled captions. We mine the large number of such full text articles freely available on the Internet in order to supplement the limited amount of annotated data available. By exploiting the explicit and implicit common structure of the different subsections of these documents, including the unlabeled full text, we are able to generate robust features that are insensitive to changes in marginal and conditional distributions of classes and data across domains. We supplement these domain-insensitive features with automatically obtained high-confidence positive and negative predictions on the target domain to learn extractors that generalize well from one section of a document to another. Finally, lacking labeled target testing data, we employ comparative user preference studies to evaluate the relative performance of the proposed methods with respect to existing baselines.