A bilingual corpus of novels aligned at paragraph level

Authors:
Alexander Gelbukh;Grigori Sidorov;José Ángel Vera-Félix
Affiliations:
Natural Language and Text Processing Laboratory, Center for Research in Computer Science, National Polytechnic Institute, Mexico City, Mexico;Natural Language and Text Processing Laboratory, Center for Research in Computer Science, National Polytechnic Institute, Mexico City, Mexico;Natural Language and Text Processing Laboratory, Center for Research in Computer Science, National Polytechnic Institute, Mexico City, Mexico
Venue:
FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
Year:
2006

Citing 8
Cited 1

A Multilingual Procedure for Dictionary-Based Sentence Alignment

AMTA '98 Proceedings of the Third Conference of the Association for Machine Translation in the Americas on Machine Translation and the Information Soup
Text-translation alignment

Computational Linguistics - Special issue on using large corpora: I
Methods and practical issues in evaluating alignment techniques

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Aligning sentences in parallel corpora

ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics
A program for aligning sentences in bilingual corpora

ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics
Aligning sentences in bilingual corpora using lexical information

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Approach to construction of automatic morphological analysis systems for inflective languages with little effort

CICLing'03 Proceedings of the 4th international conference on Computational linguistics and intelligent text processing
On some optimization heuristics for lesk-like WSD algorithms

NLDB'05 Proceedings of the 10th international conference on Natural Language Processing and Information Systems

E-connecting Balkan languages

MRTECEEL '09 Proceedings of the Workshop on Multilingual Resources, Technologies and Evaluation for Central and Eastern European Languages

Quantified Score

Hi-index	0.00

Visualization

Abstract

The paper presents a bilingual English-Spanish parallel corpus aligned at the paragraph level. The corpus consists of twelve large novels found in Internet and converted into text format with manual correction of formatting problems and errors. We used a dictionary-based algorithm for automatic alignment of the corpus. Evaluation of the results of alignment is given. There are very few available resources as far as parallel fiction texts are concerned, while they are non-trivial case of alignment of a considerable size. Usually, approaches for automatic alignment that are based on linguistic data are applied for texts in the restricted areas, like laws, manuals, etc. It is not obvious that these methods are applicable for fiction texts because these texts have much more cases of non-literal translation than the texts in the restricted areas. We show that the results of alignment for fiction texts using dictionary based method are good, namely, produce state of art precision value.