Bitext maps and alignment via pattern recognition

Authors:
I. Dan Melamed
Affiliations:
West Group
Venue:
Computational Linguistics
Year:
1999

Citing 19
Cited 50

Identifying word correspondence in parallel texts

HLT '91 Proceedings of the workshop on Speech and Natural Language
Building probabilistic models for natural language

Building probabilistic models for natural language
A fast algorithm for computing longest common subsequences

Communications of the ACM
Dynamic Programming

Dynamic Programming
Empirical methods for exploiting parallel texts

Empirical methods for exploiting parallel texts
Text-translation alignment

Computational Linguistics - Special issue on using large corpora: I
The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
Reading more into foreign languages

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Semi-automatic acquisition of domain-specific translation lexicons

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Machine transliteration

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
A portable algorithm for mapping bitext correspondence

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Aligning sentences in parallel corpora

ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics
A program for aligning sentences in bilingual corpora

ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics
Char_align: a program for aligning parallel texts at the character level

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
A pattern matching method for finding noun and proper noun translations from noisy parallel corpora

ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
Aligning a parallel English-Chinese corpus statistically with lexical criteria

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
Automatic alignment in parallel corpora

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
K-vec: a new approach for aligning parallel texts

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 2
Automatic detection of omissions in translations

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2

Extracting Equivalents from Aligned Parallel Texts: Comparison of Measures of Similarity

IBERAMIA-SBIA '00 Proceedings of the International Joint Conference, 7th Ibero-American Conference on AI: Advances in Artificial Intelligence
Empirical Methods for MT Lexicon Development

AMTA '98 Proceedings of the Third Conference of the Association for Machine Translation in the Americas on Machine Translation and the Information Soup
A Self-Learning Method of Parallel Texts Alignment

AMTA '00 Proceedings of the 4th Conference of the Association for Machine Translation in the Americas on Envisioning Machine Translation in the Information Future
Adaptive Bilingual Sentence Alignment

AMTA '02 Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on Machine Translation: From Research to Real Users
Inducing multilingual text analysis tools via robust projection across aligned corpora

HLT '01 Proceedings of the first international conference on Human language technology research
Determining recurrent sound correspondences by inducing translation models

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Towards a unified approach to memory- and statistical-based machine translation

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
METER: MEasuring TExt Reuse

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Identifying cognates by phonetic and semantic similarity

NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
Inducing multilingual POS taggers and NP bracketers via robust projection across aligned corpora

NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
A web-trained extraction summarization system

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Using confidence bands for parallel texts alignment

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Improving Machine Translation Performance by Exploiting Non-Parallel Corpora

Computational Linguistics
Sentence alignment for monolingual comparable corpora

EMNLP '03 Proceedings of the 2003 conference on Empirical methods in natural language processing
An automatic filter for non-parallel texts

ACLdemo '04 Proceedings of the ACL 2004 on Interactive poster and demonstration sessions
Identification of confusable drug names: a new approach and evaluation methodology

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Sentence alignment using P-NNT and GMM

Computer Speech and Language
Application of a word-alignment algorithm to bilingual Greek-Latin documents

ACS'07 Proceedings of the 7th Conference on 7th WSEAS International Conference on Applied Computer Science - Volume 7
Semantic text similarity using corpus-based word similarity and string similarity

ACM Transactions on Knowledge Discovery from Data (TKDD)
Applications of corpus-based semantic similarity and word segmentation to database schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
Methods for extracting and classifying pairs of cognates and false friends

Machine Translation
English-Arabic proper-noun transliteration-pairs creation

Journal of the American Society for Information Science and Technology
Approximate String Matching Techniques for Effective CLIR Among Indian Languages

WILF '07 Proceedings of the 7th international workshop on Fuzzy Logic and Applications: Applications of Fuzzy Sets Theory
Automatic extraction of translations from web-based bilingual materials

Machine Translation
Automatic prediction of cognate orthography using support vector machines

ACL '07 Proceedings of the 45th Annual Meeting of the ACL: Student Research Workshop
A discriminative candidate generator for string transformations

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
On the complexity of alignment problems in two synchronous grammar formalisms

SSST '09 Proceedings of the Third Workshop on Syntax and Structure in Statistical Translation
NUS at WMT09: domain adaptation experiments for English-Spanish machine translation of news commentary text

StatMT '09 Proceedings of the Fourth Workshop on Statistical Machine Translation
Automatic identification of confusable drug names

Artificial Intelligence in Medicine
SMS based interface for FAQ retrieval

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
Chinese-Uyghur sentence alignment: an approach based on anchor sentences

BUCC '09 Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora
Unsupervised tokenization for machine translation

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
Real-word spelling correction using Google Web IT 3-grams

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3
Improved statistical machine translation for resource-poor languages using related resource-rich languages

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3
Computing word similarity and identifying cognates with pair hidden Markov models

CONLL '05 Proceedings of the Ninth Conference on Computational Natural Language Learning
Aligning portuguese and chinese parallel texts using confidence bands

PRICAI'00 Proceedings of the 6th Pacific Rim international conference on Artificial intelligence
A knowledge-rich approach to measuring the similarity between Bulgarian and Russian words

MRTECEEL '09 Proceedings of the Workshop on Multilingual Resources, Technologies and Evaluation for Central and Eastern European Languages
LetsMT! --Online Platform for Sharing Training Data and Building User Tailored Machine Translation

Proceedings of the 2010 conference on Human Language Technologies -- The Baltic Perspective: Proceedings of the Fourth International Conference Baltic HLT 2010
Handling noisy queries in cross language FAQ retrieval

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Variant search and syntactic tree similarity based approach to retrieve matching questions for SMS queries

AND '10 Proceedings of the fourth workshop on Analytics for noisy unstructured text data
A survey of paraphrasing and textual entailment methods

Journal of Artificial Intelligence Research
Unsupervised cleansing of noisy text

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Matching samples of multiple views

Data Mining and Knowledge Discovery
Hybrid data mining approaches for prevention of drug dispensing errors

Journal of Intelligent Information Systems
Measuring spelling similarity for cognate identification

EPIA'11 Proceedings of the 15th Portugese conference on Progress in artificial intelligence
Using natural alignment to extract translation equivalents

PROPOR'06 Proceedings of the 7th international conference on Computational Processing of the Portuguese Language
N-gram similarity and distance

SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval
Enabling users to create their own web-based machine translation engine

Proceedings of the 21st international conference companion on World Wide Web
Improving statistical machine translation for a resource-poor language using related resource-rich languages

Journal of Artificial Intelligence Research
Design of a hybrid high quality machine translation system

EACL 2012 Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Texts that are available in two languages (bitexts) are becoming more and more plentiful, both in private data warehouses and on publicly accessible sites on the World Wide Web. As with other kinds of data, the value of bitexts largely depends on the efficacy of the available data mining tools. The first step in extracting useful information from bitexts is to find corresponding words and/or text segment boundaries in their two halves (bitext maps).This article advances the state of the art of bitext mapping by formulating the problem in terms of pattern recognition. From this point of view, the success of a bitext mapping algorithm hinges on how well it performs three tasks: signal generation, noise filtering, and search. The Smooth Injective Map Recognizer (SIMR) algorithm presented here integrates innovative approaches to each of these tasks. Objective evaluation has shown that SIMR's accuracy is consistently high for language pairs as diverse as French/English and Korean/English. If necessary, SIMR's bitext maps can be efficiently converted into segment alignments using the Geometric Segment Alignment (GSA) algorithm, which is also presented here.SIMR has produced bitext maps for over 200 megabytes of French-English bitexts. GSA has converted these maps into alignments. Both the maps and the alignments are available from the Linguistic Data Consortium.