Aligning sentences in parallel corpora

Authors:
Peter F. Brown;Jennifer C. Lai;Robert L. Mercer
Affiliations:
IBM Thomas J. Watson Research Center, Yorktown Heights, NY;IBM Thomas J. Watson Research Center, Yorktown Heights, NY;IBM Thomas J. Watson Research Center, Yorktown Heights, NY
Venue:
ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics
Year:
1991

Citing 5
Cited 122

A statistical approach to machine translation

Computational Linguistics
Dynamic Programming

Dynamic Programming
A program for aligning sentences in bilingual corpora

ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics
The BICORD system: combining lexical information from bilingual corpora and machine readable dictionaries

COLING '90 Proceedings of the 13th conference on Computational linguistics - Volume 3
A statistical approach to language translation

COLING '88 Proceedings of the 12th conference on Computational linguistics - Volume 1

Identifying word correspondence in parallel texts

HLT '91 Proceedings of the workshop on Speech and Natural Language
Translating collocations for bilingual lexicons: a statistical approach

Computational Linguistics
The decomposition of human-written summary sentences

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Machine translation and monolingual information retrieval (poster abstract)

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
A Technical Word- and Term-Translation Aid Using Noisy Parallel Corpora across Language Groups

Machine Translation
Termight: Coordinating Humans and Machines in Bilingual Terminology Acquisition

Machine Translation
Bilingual Sentence Alignment: Balancing Robustness and Accuracy

Machine Translation
Line ‘Em Up: Advances in Alignment Technology and their Impact on Translation Support Tools

Machine Translation
The Origins of the Translator‘s Workstation

Machine Translation
Alignment and Matching of Bilingual English–Chinese News Texts

Machine Translation
Automatic Extraction of Rules for AnaphoraResolution of Japanese Zero Pronouns in Japanese–English Machine Translation from Aligned Sentence Pairs

Machine Translation
Using hidden Markov modeling to decompose human-written summaries

Computational Linguistics - Summarization
Review of "Empirical methods for exploiting parallel texts" by I. Dan Melamed, Cambridge, MA: MIT Press, 2001

Computational Linguistics
Bilingual Dictionary Based Sentence Alignment for Chinese English Bitext

ICMI '00 Proceedings of the Third International Conference on Advances in Multimodal Interfaces
Extracting Equivalents from Aligned Parallel Texts: Comparison of Measures of Similarity

IBERAMIA-SBIA '00 Proceedings of the International Joint Conference, 7th Ibero-American Conference on AI: Advances in Artificial Intelligence
Knowledge Extraction from Bilingual Corpora

Information Extraction: Towards Scalable, Adaptable Systems
Building Parallel Corpora by Automatic Title Alignment

ICADL '02 Proceedings of the 5th International Conference on Asian Digital Libraries: Digital Libraries: People, Knowledge, and Technology
A Multilingual Procedure for Dictionary-Based Sentence Alignment

AMTA '98 Proceedings of the Third Conference of the Association for Machine Translation in the Americas on Machine Translation and the Information Soup
Ordering Translation Templates by Assigning Confidence Factors

AMTA '98 Proceedings of the Third Conference of the Association for Machine Translation in the Americas on Machine Translation and the Information Soup
A Self-Learning Method of Parallel Texts Alignment

AMTA '00 Proceedings of the 4th Conference of the Association for Machine Translation in the Americas on Envisioning Machine Translation in the Information Future
Adaptive Bilingual Sentence Alignment

AMTA '02 Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on Machine Translation: From Research to Real Users
Fast and Accurate Sentence Alignment of Bilingual Corpora

AMTA '02 Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on Machine Translation: From Research to Real Users
Automatic construction of English/Chinese parallel corpora

Journal of the American Society for Information Science and Technology
Using cognates to align sentences in bilingual corpora

CASCON '93 Proceedings of the 1993 conference of the Centre for Advanced Studies on Collaborative research: distributed computing - Volume 2
Translation analysis and translation automation

CASCON '93 Proceedings of the 1993 conference of the Centre for Advanced Studies on Collaborative research: distributed computing - Volume 2
A program for aligning sentences in bilingual corpora

Computational Linguistics - Special issue on using large corpora: I
Text-translation alignment

Computational Linguistics - Special issue on using large corpora: I
The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
A class-based approach to word alignment

Computational Linguistics
Stochastic inversion transduction grammars and bilingual parsing of parallel corpora

Computational Linguistics
Bitext maps and alignment via pattern recognition

Computational Linguistics
Automatic construction of parallel English-Chinese corpus for cross-language information retrieval

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Word-for-word glossing with contextually similar words

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Termight: identifying and translating technical terminology

ANLC '94 Proceedings of the fourth conference on Applied natural language processing
Example retrieval from a translation memory

Natural Language Engineering
High-performance bilingual text alignment using statistical and dictionary information

Natural Language Engineering
Tagging and alignment of parallel texts: current status of BCP

ANLC '92 Proceedings of the third conference on Applied natural language processing
Automating the acquisition of bilingual terminology

EACL '93 Proceedings of the sixth conference on European chapter of the Association for Computational Linguistics
Text alignment in a tool for translating revised documents

EACL '93 Proceedings of the sixth conference on European chapter of the Association for Computational Linguistics
An alignment method for noisy parallel corpora based on image processing techniques

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
A portable algorithm for mapping bitext correspondence

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Bitext correspondences through rich mark-up

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
An experiment in hybrid dictionary and statistical sentence alignment

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
An IR approach for translating new words from nonparallel, comparable texts

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Methods and practical issues in evaluating alignment techniques

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Char_align: a program for aligning parallel texts at the character level

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Aligning sentences in bilingual corpora using lexical information

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
An algorithm for finding noun phrase correspondences in bilingual corpora

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Structural matching of parallel texts

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
A pattern matching method for finding noun and proper noun translations from noisy parallel corpora

ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
Aligning a parallel English-Chinese corpus statistically with lexical criteria

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
Automatic alignment in parallel corpora

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
High-performance bilingual text alignment using statistical and dictionary information

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Bilingual text, matching using bilingual dictionary and statistics

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 2
K-vec: a new approach for aligning parallel texts

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 2
Building an MT dictionary from parallel texts based on linguistic and statistical information

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1
A matching technique in Example-Based Machine Translation

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1
A part-of-speech-based alignment algorithm

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1
Aligning sentences in bilingual texts: French-English and French-Arabic

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 2
Learning translation templates from bilingual text

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 2
Extracting word correspondences from bilingual corpora based on word co-occurrences information

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
Acquisition of phrase-level bilingual correspondence using dependency structure

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
Should we translate the documents or the queries in cross-language information retrieval?

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Mixed language query disambiguation

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Building parallel corpora by automatic title alignment using length-based and text-based approaches

Information Processing and Management: an International Journal
Creating a multilingual collocation dictionary from large text corpora

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 2
Translation Disambiguation in Mixed Language Queries

Machine Translation
A robust cross-style bilingual sentences alignment model

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
METER: MEasuring TExt Reuse

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Using confidence bands for parallel texts alignment

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
PENS: a machine-aided english writing system for Chinese users

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Dividing and conquering long sentences in a translation system

HLT '91 Proceedings of the workshop on Speech and Natural Language
Translating collocations for use in bilingual lexicons

HLT '94 Proceedings of the workshop on Human Language Technology
Constructing of a large-scale Chinese-English parallel corpus

COLING '02 Proceedings of the 3rd workshop on Asian language resources and international standardization - Volume 12
Word alignment baselines

HLT-NAACL-PARALLEL '03 Proceedings of the HLT-NAACL 2003 Workshop on Building and using parallel texts: data driven machine translation and beyond - Volume 3
Efficient optimization for bilingual sentence alignment based on linear regression

HLT-NAACL-PARALLEL '03 Proceedings of the HLT-NAACL 2003 Workshop on Building and using parallel texts: data driven machine translation and beyond - Volume 3
Construction and analysis of Japanese-English broadcast news corpus with named entity tags

MultiNER '03 Proceedings of the ACL 2003 workshop on Multilingual and mixed-language named entity recognition - Volume 15
Conceptual analysis of parallel corpus collected from the Web

Journal of the American Society for Information Science and Technology
A DOM tree alignment model for mining parallel data from the web

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
SlideSeer: a digital library of aligned document and presentation pairs

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Sentence alignment using P-NNT and GMM

Computer Speech and Language
Multilingual lexical database generation from parallel texts in 20 European languages with endogenous resources

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
ATLAS: a new text alignment architecture

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Chinese Ancient-Modern Sentence Alignment

ICCS '07 Proceedings of the 7th international conference on Computational Science, Part II
Cross Sentence Alignment for Structurally Dissimilar Corpus Based on Singular Value Decomposition

ICIC '08 Proceedings of the 4th international conference on Intelligent Computing: Advanced Intelligent Computing Theories and Applications - with Aspects of Artificial Intelligence
Automatic extraction of translations from web-based bilingual materials

Machine Translation
Increase the efficiency of English-Chinese sentence alignment: target range restriction and empirical selection of stop words

WSEAS Transactions on Computers
Sentence alignment of Hungarian-English parallel corpora using a hybrid algorithm

Acta Cybernetica
Constructing Parallel Corpus from Movie Subtitles

ICCPOL '09 Proceedings of the 22nd International Conference on Computer Processing of Oriental Languages. Language Technology for the Knowledge-based Economy
Tagging Sentence Boundaries in Biomedical Literature

CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing
Improved sentence alignment on parallel web pages using a stochastic tree alignment model

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Mining a comparable text corpus for a Vietnamese - French statistical machine translation system

StatMT '09 Proceedings of the Fourth Workshop on Statistical Machine Translation
A hybrid approach to align sentences and words in English-Hindi parallel corpora

ParaText '05 Proceedings of the ACL Workshop on Building and Using Parallel Texts
Comparison, selection and use of sentence alignment algorithms for new language pairs

ParaText '05 Proceedings of the ACL Workshop on Building and Using Parallel Texts
Arabic to French sentence alignment: exploration of a cross-language information retrieval approach

Semitic '07 Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources
Chinese-Uyghur sentence alignment: an approach based on anchor sentences

BUCC '09 Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora
Aligning portuguese and chinese parallel texts using confidence bands

PRICAI'00 Proceedings of the 6th Pacific Rim international conference on Artificial intelligence
Selecting target word using contexonym comparison method

Proceedings of the 2007 conference on Human interface: Part I
Local context selection for aligning sentences in parallel corpora

CONTEXT'07 Proceedings of the 6th international and interdisciplinary conference on Modeling and using context
Context-based sentence alignment in parallel corpora

CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing
Text-based English-Arabic sentence alignment

ICIC'06 Proceedings of the 2006 international conference on Intelligent computing: Part II
Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Fast-Champollion: a fast and robust sentence alignment algorithm

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Development of Hindi-Punjabi parallel corpus using existing Hindi-Punjabi machine translation system

Proceedings of the First International Conference on Intelligent Interactive Technologies and Multimedia
Explicit length modelling for statistical machine translation

IbPRIA'11 Proceedings of the 5th Iberian conference on Pattern recognition and image analysis
An Expectation Maximization algorithm for textual unit alignment

BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
Building a web-based parallel corpus and filtering out machine-translated text

BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
Alignment of paragraphs in bilingual texts using bilingual dictionaries and dynamic programming

CIARP'06 Proceedings of the 11th Iberoamerican conference on Progress in Pattern Recognition, Image Analysis and Applications
A bilingual corpus of novels aligned at paragraph level

FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
Evaluation of alignment methods for HTML parallel text

FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
Mining bilingual lexical equivalences out of parallel corpora

SETN'06 Proceedings of the 4th Helenic conference on Advances in Artificial Intelligence
Approximate phrase match to compile synonymous translation terms for korean medical indexing

AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology
Bilingual sentence alignment based on punctuation statistics and lexicon

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
Probabilistic neural network based english-arabic sentence alignment

CICLing'06 Proceedings of the 7th international conference on Computational Linguistics and Intelligent Text Processing
Chinese-Japanese clause alignment

CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing
Paragraph-Level alignment of an english-spanish parallel corpus of fiction texts using bilingual dictionaries

TSD'06 Proceedings of the 9th international conference on Text, Speech and Dialogue
Combining sentence length with location information to align monolingual parallel texts

AIRS'04 Proceedings of the 2004 international conference on Asian Information Retrieval Technology
Using natural alignment to extract translation equivalents

PROPOR'06 Proceedings of the 7th international conference on Computational Processing of the Portuguese Language
Extracting parallel paragraphs and sentences from english-persian translated documents

AIRS'11 Proceedings of the 7th Asia conference on Information Retrieval Technology
Explicit length modelling for statistical machine translation

Pattern Recognition
Generalized biwords for bitext compression and translation spotting

Journal of Artificial Intelligence Research
High-quality bilingual subtitle document alignments with application to spontaneous speech translation

Computer Speech and Language

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we describe a statistical technique for aligning sentences with their translations in two parallel corpora. In addition to certain anchor points that are available in our data, the only information about the sentences that we use for calculating alignments is the number of tokens that they contain. Because we make no use of the lexical details of the sentence, the alignment computation is fast and therefore practical for application to very large collections of text. We have used this technique to align several million sentences in the English-French Hansard corpora and have achieved an accuracy in excess of 99% in a random selected set of 1000 sentence pairs that we checked by hand. We show that even without the benefit of anchor points the correlation between the lengths of aligned sentences is strong enough that we should expect to achieve an accuracy of between 96% and 97%. Thus, the technique may be applicable to a wider variety of texts than we have yet tried.