Fast and Accurate Sentence Alignment of Bilingual Corpora
AMTA '02 Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on Machine Translation: From Research to Real Users
A systematic comparison of various statistical alignment models
Computational Linguistics
A program for aligning sentences in bilingual corpora
Computational Linguistics - Special issue on using large corpora: I
Empirical methods for compound splitting
EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Moses: open source toolkit for statistical machine translation
ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
Preliminary study into query translation for patent retrieval
PaIR '10 Proceedings of the 3rd international workshop on Patent information retrieval
Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
KenLM: faster and smaller language model queries
WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation
Structural and topical dimensions in multi-task patent translation
EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Hi-index | 0.00 |
Statistical machine translation of patents requires large amounts of sentence-parallel data. Translations of patent text often exist for parts of the patent document, namely title, abstract and claims. However, there are no direct translations of the largest part of the document, the description or background of the invention. We document a twofold approach for extracting parallel data from all patent document sections from a large multilingual patent corpus. Since language and style differ depending on document section (title, abstract, description, claims) and patent topic (according to the International Patent Classification), we sort the processed data into subdomains in order to enable its use in domain-oriented translation, e.g. when applying multi-task learning. We investigate several similarity metrics and apply them to the domains of patent topic and patent document sections. Product of our research is a corpus of 23 million parallel German-English sentences extracted from the MAREC patent corpus and a descriptive analysis of its subdomains.