A program for aligning sentences in bilingual corpora

Authors:
William A. Gale;Kenneth W. Church
Affiliations:
AT&T Bell Laboratories;AT&T Bell Laboratories
Venue:
Computational Linguistics - Special issue on using large corpora: I
Year:
1993

Citing 6
Cited 113

A statistical approach to machine translation

Computational Linguistics
A stochastic parts program and noun phrase parser for unrestricted text

ANLC '88 Proceedings of the second conference on Applied natural language processing
Aligning sentences in parallel corpora

ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics
The BICORD system: combining lexical information from bilingual corpora and machine readable dictionaries

COLING '90 Proceedings of the 13th conference on Computational linguistics - Volume 3
A statistical approach to language translation

COLING '88 Proceedings of the 12th conference on Computational linguistics - Volume 1
Handbook of Mathematical Functions, With Formulas, Graphs, and Mathematical Tables,

Handbook of Mathematical Functions, With Formulas, Graphs, and Mathematical Tables,

Translating collocations for bilingual lexicons: a statistical approach

Computational Linguistics
Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
A Technical Word- and Term-Translation Aid Using Noisy Parallel Corpora across Language Groups

Machine Translation
Alignment and Matching of Bilingual English–Chinese News Texts

Machine Translation
Semantic Inference for Anaphora Resolution: Toward a Framework in Machine Translation

Machine Translation
Review Article: Example-based Machine Translation

Machine Translation
Using Corpus-Based Approaches in a System for Multilingual Information Retrieval

Information Retrieval
Review of "Empirical methods for exploiting parallel texts" by I. Dan Melamed, Cambridge, MA: MIT Press, 2001

Computational Linguistics
World Wide Web - A Multilingual Language Resource

WI '01 Proceedings of the First Asia-Pacific Conference on Web Intelligence: Research and Development
Multilingual Information Retrieval Based on Document Alignment Techniques

ECDL '98 Proceedings of the Second European Conference on Research and Advanced Technology for Digital Libraries
The Challenge of Parallel Text Processing

TDS '00 Proceedings of the Third International Workshop on Text, Speech and Dialogue
A Multilingual Procedure for Dictionary-Based Sentence Alignment

AMTA '98 Proceedings of the Third Conference of the Association for Machine Translation in the Americas on Machine Translation and the Information Soup
Fast and Accurate Sentence Alignment of Bilingual Corpora

AMTA '02 Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on Machine Translation: From Research to Real Users
Multilingual Information Retrieval Based on Parallel Texts from the Web

CLEF '00 Revised Papers from the Workshop of Cross-Language Evaluation Forum on Cross-Language Information Retrieval and Evaluation
Cross-language information retrieval: experiments based on CLEF 2000 corpora

Information Processing and Management: an International Journal
Accurate methods for the statistics of surprise and coincidence

Computational Linguistics - Special issue on using large corpora: I
Adaptive multilingual sentence boundary disambiguation

Computational Linguistics
A class-based approach to word alignment

Computational Linguistics
The automatic translation of discourse structures

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Adaptive sentence boundary disambiguation

ANLC '94 Proceedings of the fourth conference on Applied natural language processing
High-performance bilingual text alignment using statistical and dictionary information

Natural Language Engineering
An experiment in hybrid dictionary and statistical sentence alignment

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
An IR approach for translating new words from nonparallel, comparable texts

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Flow network models for word alignment and terminology extraction from bilingual corpora

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
A pattern matching method for finding noun and proper noun translations from noisy parallel corpora

ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
High-performance bilingual text alignment using statistical and dictionary information

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Structural feature selection for English-Korean statistical machine translation

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Bilingual text, matching using bilingual dictionary and statistics

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 2
Building an MT dictionary from parallel texts based on linguistic and statistical information

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1
An IBM-PC environment for Chinese corpus analysis

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1
Derivation of underlying valency frames from a learner's dictionary

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 2
Aligning more words with high precision for small bilingual corpora

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
Multi-level similar segment matching algorithm for translation memories and Example-based Machine Translation

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
Translating unknown cross-lingual queries in digital libraries using a web-based approach

Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Mixed language query disambiguation

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Automatic identification of word translations from unrelated English and German corpora

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Creating a multilingual collocation dictionary from large text corpora

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 2
Linguistic variation and computation

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Translation Disambiguation in Mixed Language Queries

Machine Translation
A cheap and fast way to build useful translation lexicons

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
A robust cross-style bilingual sentences alignment model

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
METER: MEasuring TExt Reuse

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Reliable measures for aligning Japanese-English news articles and sentences

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Evaluation challenges in large-scale document summarization

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Extracting significant words from corpora for ontology extraction

Proceedings of the 3rd international conference on Knowledge capture
Comparative study of monolingual and multilingual search models for use with asian languages

ACM Transactions on Asian Language Information Processing (TALIP)
Word alignment baselines

HLT-NAACL-PARALLEL '03 Proceedings of the HLT-NAACL 2003 Workshop on Building and using parallel texts: data driven machine translation and beyond - Volume 3
Aligning and using an English-Inuktitut parallel corpus

HLT-NAACL-PARALLEL '03 Proceedings of the HLT-NAACL 2003 Workshop on Building and using parallel texts: data driven machine translation and beyond - Volume 3
Construction and analysis of Japanese-English broadcast news corpus with named entity tags

MultiNER '03 Proceedings of the ACL 2003 workshop on Multilingual and mixed-language named entity recognition - Volume 15
Exploiting the Web as the multilingual corpus for unknown query translation

Journal of the American Society for Information Science and Technology
Automatic extraction of bilingual word pairs using inductive chain learning in various languages

Information Processing and Management: an International Journal
Filtering or adapting: two strategies to exploit noisy parallel corpora for cross-language information retrieval

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Multi-level bootstrapping for extracting parallel sentences from a quasi-comparable corpus

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Robust sub-sentential alignment of phrase-structure trees

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Sentence alignment using P-NNT and GMM

Computer Speech and Language
ATLAS: a new text alignment architecture

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Statistical machine translation

ACM Computing Surveys (CSUR)
Critical Edition of Sanskrit Texts

Sanskrit Computational Linguistics
Constructing Parallel Corpus from Movie Subtitles

ICCPOL '09 Proceedings of the 22nd International Conference on Computer Processing of Oriental Languages. Language Technology for the Knowledge-based Economy
Translating medical terminologies through word alignment in parallel text corpora

Journal of Biomedical Informatics
On the use of comparable corpora to improve SMT performance

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Improved sentence alignment on parallel web pages using a stochastic tree alignment model

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Pseudo-aligned multilingual corpora

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Using normalized alignment scores to detect incorrectly aligned segments

Proceedings of the 2nd international workshop on Patent information retrieval
Nukti: English-Inuktitut word alignment system description

ParaText '05 Proceedings of the ACL Workshop on Building and Using Parallel Texts
Partitioning parallel documents using binary segmentation

StatMT '06 Proceedings of the Workshop on Statistical Machine Translation
Exploiting comparable corpora with TER and TERp

BUCC '09 Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora
Phrase Translation Extraction from Aligned Parallel Corpora Using Suffix Arrays and Related Structures

EPIA '09 Proceedings of the 14th Portuguese Conference on Artificial Intelligence: Progress in Artificial Intelligence
Bilingual concordancers and translation memories: a comparative evaluation

LRTWRT '04 Proceedings of the Second International Workshop on Language Resources for Translation Work, Research and Training
Aligning portuguese and chinese parallel texts using confidence bands

PRICAI'00 Proceedings of the 6th Pacific Rim international conference on Artificial intelligence
Selecting target word using contexonym comparison method

Proceedings of the 2007 conference on Human interface: Part I
Local context selection for aligning sentences in parallel corpora

CONTEXT'07 Proceedings of the 6th international and interdisciplinary conference on Modeling and using context
Context-based sentence alignment in parallel corpora

CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing
BabelNet: building a very large multilingual semantic network

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
LetsMT! --Online Platform for Sharing Training Data and Building User Tailored Machine Translation

Proceedings of the 2010 conference on Human Language Technologies -- The Baltic Perspective: Proceedings of the Fourth International Conference Baltic HLT 2010
Consistency checking for Treebank alignment

LAW IV '10 Proceedings of the Fourth Linguistic Annotation Workshop
Text-based English-Arabic sentence alignment

ICIC'06 Proceedings of the 2006 international conference on Intelligent computing: Part II
Evaluation of axiomatic approaches to crosslanguage retrieval

CLEF'09 Proceedings of the 10th cross-language evaluation forum conference on Multilingual information access evaluation: text retrieval experiments
Using parallel corpora for multilingual (multi-document) summarisation evaluation

CLEF'10 Proceedings of the 2010 international conference on Multilingual and multimodal information access evaluation: cross-language evaluation forum
A survey of paraphrasing and textual entailment methods

Journal of Artificial Intelligence Research
Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Using SRX standard for sentence segmentation

LTC'09 Proceedings of the 4th conference on Human language technology: challenges for computer science and linguistics
Improvement of machine translation evaluation by simple linguistically motivated features

Journal of Computer Science and Technology - Special issue on natural language processing
ParaSense or how to use parallel corpora for word sense disambiguation

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Building a web-based parallel corpus and filtering out machine-translated text

BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
An evaluation and possible improvement path for current SMT behavior on ambiguous nouns

SSST-5 Proceedings of the Fifth Workshop on Syntax, Semantics and Structure in Statistical Translation
Graph-based bilingual sentence alignment from large scale web pages

NLDB'11 Proceedings of the 16th international conference on Natural language processing and information systems
Meta similarity

Applied Intelligence
Parallel sentence generation from comparable corpora for improved SMT

Machine Translation
Evaluation of alignment methods for HTML parallel text

FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
An unsupervised alignment algorithm for text simplification corpus construction

MTTG '11 Proceedings of the Workshop on Monolingual Text-To-Text Generation
Probabilistic neural network based english-arabic sentence alignment

CICLing'06 Proceedings of the 7th international conference on Computational Linguistics and Intelligent Text Processing
Automatic filtering of bilingual corpora for statistical machine translation

NLDB'05 Proceedings of the 10th international conference on Natural Language Processing and Information Systems
Learning sentential paraphrases from bilingual parallel corpora for text-to-text generation

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Combining sentence length with location information to align monolingual parallel texts

AIRS'04 Proceedings of the 2004 international conference on Asian Information Retrieval Technology
Weighted finite-state transducer inference for limited-domain speech-to-speech translation

PROPOR'06 Proceedings of the 7th international conference on Computational Processing of the Portuguese Language
Enabling users to create their own web-based machine translation engine

Proceedings of the 21st international conference companion on World Wide Web
Extracting parallel paragraphs and sentences from english-persian translated documents

AIRS'11 Proceedings of the 7th Asia conference on Information Retrieval Technology
Aligning the un-alignable -- a pilot study using a noisy corpus of nonstandardized, semi-parallel texts

CICLing'12 Proceedings of the 13th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part II
Book review: bitext alignment jörg tiedemann (uppsala university) morgan & claypool (synthesis lectures on human language technologies, edited by graeme hirst, volume 14), 2011, 153 pp; paperbound, isbn 978-1-60845-510-2, $45.00; e-book, isbn 978-1-60815-511-9, $30.00 or by subscription

Computational Linguistics
Analyzing parallelism and domain similarities in the MAREC patent corpus

IRFC'12 Proceedings of the 5th conference on Multidisciplinary Information Retrieval
Generalized biwords for bitext compression and translation spotting

Journal of Artificial Intelligence Research
Design of a hybrid high quality machine translation system

EACL 2012 Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra)
LetsMT!: a cloud-based platform for do-it-yourself machine translation

ACL '12 Proceedings of the ACL 2012 System Demonstrations
Machine translation for multilingual summary content evaluation

Proceedings of Workshop on Evaluation Metrics and System Comparison for Automatic Summarization
Application of clause alignment for statistical machine translation

SSST-6 '12 Proceedings of the Sixth Workshop on Syntax, Semantics and Structure in Statistical Translation
BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network

Artificial Intelligence
Effective and efficient?: bilingual sentiment lexicon extraction using collocation alignment

Proceedings of the 21st ACM international conference on Information and knowledge management
A Fast and Accurate Method for Bilingual Opinion Lexicon Extraction

WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
How many multiword expressions do people know?

ACM Transactions on Speech and Language Processing (TSLP) - Special issue on multiword expressions: From theory to practice and use, part 1
Manifold alignment preserving global geometry

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Identifying useful human correction feedback from an on-line machine translation service

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Generating storylines from sensor data

Pervasive and Mobile Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Researchers in both machine translation (e.g., Brown et al. 1990) and bilingual lexicography (e.g., Klavans and Tzoukermann 1990) have recently become interested in studying bilingual corpora, bodies of text such as the Canadian Hansards (parliamentary proceedings), which are available in multiple languages (such as French and English). One useful step is to align the sentences, that is, to identify correspondences between sentences in one language and sentences in the other language.This paper will describe a method and a program (align) for aligning sentences based on a simple statistical model of character lengths. The program uses the fact that longer sentences in one language tend to be translated into longer sentences in the other language, and that shorter sentences tend to be translated into shorter sentences. A probabilistic score is assigned to each proposed correspondence of sentences, based on the scaled difference of lengths of the two sentences (in characters) and the variance of this difference. This probabilistic score is used in a dynamic programming framework to find the maximum likelihood alignment of sentences.It is remarkable that such a simple approach works as well as it does. An evaluation was performed based on a trilingual corpus of economic reports issued by the Union Bank of Switzerland (UBS) in English, French, and German. The method correctly aligned all but 4% of the sentences. Moreover, it is possible to extract a large subcorpus that has a much smaller error rate. By selecting the best-scoring 80% of the alignments, the error rate is reduced from 4% to 0.7%. There were more errors on the English-French subcorpus than on the English-German subcorpus, showing that error rates will depend on the corpus considered; however, both were small enough to hope that the method will be useful for many language pairs.To further research on bilingual corpora, a much larger sample of Canadian Hansards (approximately 90 million words, half in English and and half in French) has been aligned with the align program and will be available through the Data Collection Initiative of the Association for Computational Linguistics (ACL/DCI). In addition, in order to facilitate replication of the align program, an appendix is provided with detailed c-code of the more difficult core of the align program.