Exploiting parallel texts in the creation of multilingual semantically annotated resources: the MultiSemCor Corpus

Authors:
L. Bentivogli;E. Pianta
Affiliations:
ITC-irst, Via Sommarive, 18-38050 Povo, Trento, Italy e-mail: bentivo@itc.it;ITC-irst, Via Sommarive, 18-38050 Povo, Trento, Italy e-mail: bentivo@itc.it
Venue:
Natural Language Engineering
Year:
2005

Citing 15
Cited 10

Word sense disambiguation using a second language monolingual corpus

Computational Linguistics
A systematic comparison of various statistical alignment models

Computational Linguistics
Empirical methods for exploiting parallel texts

Empirical methods for exploiting parallel texts
The Web as a parallel corpus

Computational Linguistics - Special issue on web as corpus
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Two languages are more informative than one

ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics
Word-sense disambiguation using statistical methods

ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics
Inducing multilingual text analysis tools via robust projection across aligned corpora

HLT '01 Proceedings of the first international conference on Human language technology research
Inducing information extraction systems for new languages via cross-language projection

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
An unsupervised method for word sense tagging using parallel corpora

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
An unsupervised method for multilingual word sense tagging using parallel corpora: a preliminary investigation

WWSM '00 Proceedings of the ACL-2000 workshop on Word senses and multi-linguality - Volume 8
Experiments in word domain disambiguation for parallel texts

WWSM '00 Proceedings of the ACL-2000 workshop on Word senses and multi-linguality - Volume 8
Sense discrimination with parallel corpora

WSD '02 Proceedings of the ACL-02 workshop on Word sense disambiguation: recent successes and future directions - Volume 8
Knowledge intensive word alignment with KNOWA

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Crossing parallel corpora and multilingual lexical databases for WSD

CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing

Word sense disambiguation: A survey

ACM Computing Surveys (CSUR)
New features for FrameNet: WordNet mapping

CoNLL '09 Proceedings of the Thirteenth Conference on Computational Natural Language Learning
Automatic identification of semantic relations in Italian complex nominals

IWCS-8 '09 Proceedings of the Eighth International Conference on Computational Semantics
Cross-lingual annotation projection of semantic roles

Journal of Artificial Intelligence Research
From Italian text to TimeML document via dependency parsing

CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part II
Spanish all-words semantic class disambiguation using Cast3LB corpus

MICAI'06 Proceedings of the 5th Mexican international conference on Artificial Intelligence
Crossing parallel corpora and multilingual lexical databases for WSD

CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing
Can projected chains in parallel corpora help coreference resolution?

DAARC'11 Proceedings of the 8th international conference on Anaphora Processing and Applications
Towards a model of formal and informal address in English

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Wikipedia-based WSD for multilingual frame annotation

Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this article we illustrate and evaluate an approach to create high quality linguistically annotated resources based on the exploitation of aligned parallel corpora. This approach is based on the assumption that if a text in one language has been annotated and its translation has not, annotations can be transferred from the source text to the target using word alignment as a bridge. The transfer approach has been tested and extensively applied for the creation of the MultiSemCor corpus, an English/Italian parallel corpus created on the basis of the English SemCor corpus. In MultiSemCor the texts are aligned at the word level and word sense annotated with a shared inventory of senses. A number of experiments have been carried out to evaluate the different steps involved in the methodology and the results suggest that the transfer approach is one promising solution to the resource bottleneck. First, it leads to the creation of a parallel corpus, which represents a crucial resource per se. Second, it allows for the exploitation of existing (mostly English) annotated resources to bootstrap the creation of annotated corpora in new (resource-poor) languages with greatly reduced human effort.