A word clustering approach to domain adaptation: effective parsing of biomedical texts

Authors:
Marie Candito;Enrique Henestroza Anguiano;Djamé Seddah
Affiliations:
Alpage (Univ. Paris Diderot & INRIA), Paris, France;Alpage (Univ. Paris Diderot & INRIA), Paris, France;Alpage (Univ. Paris Diderot & INRIA), Paris, France, and Univ. Paris Sorbonne, Paris, France
Venue:
IWPT '11 Proceedings of the 12th International Conference on Parsing Technologies
Year:
2011

Citing 16
Cited 1

Class-based n-gram models of natural language

Computational Linguistics
The domain dependence of parsing

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
WordFreak: an open tool for linguistic annotation

NAACL-Demonstrations '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: Demonstrations - Volume 4
Example selection for bootstrapping statistical parsers

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Reranking and self-training for parser adaptation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Self-training for biomedical parsing

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Adapting WSJ-trained parsers to the British National Corpus using in-domain self-training

IWPT '07 Proceedings of the 10th International Conference on Parsing Technologies
MAP adaptation of stochastic grammars

Computer Speech and Language
Improving generative statistical parsing with semi-supervised word clustering

IWPT '09 Proceedings of the 11th International Conference on Parsing Technologies
"cba to check the spelling" investigating parser performance on discussion forum posts

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Handling unknown words in statistical latent-variable parsing models for Arabic, English and French

SPMRL '10 Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages
Parsing word clusters

SPMRL '10 Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages
Self-training without reranking for parser domain adaptation and its impact on semantic role labeling

DANLP 2010 Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing
Uptraining for accurate deterministic question parsing

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Benchmarking of statistical dependency parsers for French

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Parsing biomedical literature

IJCNLP'05 Proceedings of the Second international joint conference on Natural Language Processing

Learning domain differences automatically for dependency parsing adaptation

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a simple and effective way to perform out-of-domain statistical parsing by drastically reducing lexical data sparseness in a PCFG-LA architecture. We replace terminal symbols with unsupervised word clusters acquired from a large newspaper corpus augmented with biomedical targetdomain data. The resulting clusters are effective in bridging the lexical gap between source-domain and target-domain vocabularies. Our experiments combine known self-training techniques with unsupervised word clustering and produce promising results, achieving an error reduction of 21% on a new evaluation set for biomedical text with manual bracketing annotations.