The Web as a parallel corpus

Authors:
Philip Resnik;Noah A. Smith
Affiliations:
Department of Linguistics and Institute for Advanced Computer Studies, University of Maryland, College Park, MD;Department of Computer Science and Center for Language and Speech Processing, John Hopkins University, Baltimore, MD
Venue:
Computational Linguistics - Special issue on web as corpus
Year:
2003

Citing 22
Cited 113

A statistical approach to machine translation

Computational Linguistics
Identifying word correspondence in parallel texts

HLT '91 Proceedings of the workshop on Speech and Natural Language
Network flows: theory, algorithms, and applications

Network flows: theory, algorithms, and applications
Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
An Information-Theoretic Definition of Similarity

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Models of translational equivalence among words

Computational Linguistics
Introduction to the special issue on computational linguistics using large corpora

Computational Linguistics - Special issue on using large corpora: I
Semi-automatic acquisition of domain-specific translation lexicons

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Unsupervised word sense disambiguation rivaling supervised methods

ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
Mining the Web for bilingual text

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Improved cross-language retrieval using backoff translation

HLT '01 Proceedings of the first international conference on Human language technology research
Inducing multilingual text analysis tools via robust projection across aligned corpora

HLT '01 Proceedings of the first international conference on Human language technology research
Inducing information extraction systems for new languages via cross-language projection

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
An unsupervised method for word sense tagging using parallel corpora

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Discriminative training and maximum entropy models for statistical machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Bootstrapping

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Evaluating translational correspondence using annotation projection

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Inducing multilingual POS taggers and NP bracketers via robust projection across aligned corpora

NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
A best-first alignment algorithm for automatic extraction of transfer mappings from bilingual corpora

DMMT '01 Proceedings of the workshop on Data-driven methods in machine translation - Volume 14
Building a shallow Arabic Morphological Analyzer in one day

SEMITIC '02 Proceedings of the ACL-02 workshop on Computational approaches to semitic languages
From words to corpora: recognizing translation

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10

Towards the self-annotating web

Proceedings of the 13th international conference on World Wide Web
Learning by googling

ACM SIGKDD Explorations Newsletter
Text characteristics of English language university Web sites: Research Articles

Journal of the American Society for Information Science and Technology
Web-based models for natural language processing

ACM Transactions on Speech and Language Processing (TSLP)
Mining translations of OOV terms from the web through cross-lingual query expansion

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Exploiting parallel texts in the creation of multilingual semantically annotated resources: the MultiSemCor Corpus

Natural Language Engineering
Improving Machine Translation Performance by Exploiting Non-Parallel Corpora

Computational Linguistics
Stemming to improve translation lexicon creation form bitexts

Information Processing and Management: an International Journal
A study of statistical models for query translation: finding a good unit of translation

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Alignment of bilingual named entities in parallel corpora using statistical models and multiple knowledge sources

ACM Transactions on Asian Language Information Processing (TALIP)
Orthographic Errors in Web Pages: Toward Cleaner Web Corpora

Computational Linguistics
Named entity translation matching and learning: With application for mining unseen translations

ACM Transactions on Information Systems (TOIS)
Statistical machine translation with word- and sentence-aligned parallel corpora

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
An automatic filter for non-parallel texts

ACLdemo '04 Proceedings of the ACL 2004 on Interactive poster and demonstration sessions
Extracting parallel sub-sentential fragments from non-parallel corpora

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
A DOM tree alignment model for mining parallel data from the web

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Novel association measures using web search with double checking

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Multi-level bootstrapping for extracting parallel sentences from a quasi-comparable corpus

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Mining key phrase translations from web corpora

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Improved statistical machine translation using paraphrases

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
Is it correct?: towards web-based evaluation of automatic natural language phrase generation

COLING-ACL '06 Proceedings of the COLING/ACL on Interactive presentation sessions
The linguist's search engine: an overview

ACLdemo '05 Proceedings of the ACL 2005 on Interactive poster and demonstration sessions
Statistical query translation models for cross-language information retrieval

ACM Transactions on Asian Language Information Processing (TALIP)
Measuring semantic similarity between words using web search engines

Proceedings of the 16th international conference on World Wide Web
Sentence alignment using P-NNT and GMM

Computer Speech and Language
An Intelligent Web Agent to Mine Bilingual Parallel Pages via Automatic Discovery of URL Pairing Patterns

WI-IATW '07 Proceedings of the 2007 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Workshops
Pattern-based automatic taxonomy learning from the Web

AI Communications
Statistical machine translation

ACM Computing Surveys (CSUR)
Integrating Cross-Language Hierarchies and Its Application to Retrieving Relevant Documents

ACM Transactions on Asian Language Information Processing (TALIP)
Quantitative comparisons of search engine results

Journal of the American Society for Information Science and Technology
AEON - An approach to the automatic evaluation of ontologies

Applied Ontology - Ontological Foundations of Conceptual Modelling
Pivot language approach for phrase-based statistical machine translation

Machine Translation
Automatic extraction of translations from web-based bilingual materials

Machine Translation
Word sense disambiguation: A survey

ACM Computing Surveys (CSUR)
Mapping geographic coverage of the web

Proceedings of the 16th ACM SIGSPATIAL international conference on Advances in geographic information systems
Advanced Information Retrieval

Electronic Notes in Theoretical Computer Science (ENTCS)
Query Classification and Expansion for Translation Mining Via Search Engines

PRICAI '08 Proceedings of the 10th Pacific Rim International Conference on Artificial Intelligence: Trends in Artificial Intelligence
Zero-Anaphora Resolution in Chinese Using Maximum Entropy

IEICE - Transactions on Information and Systems
Translating medical terminologies through word alignment in parallel text corpora

Journal of Biomedical Informatics
The SAWA corpus: a parallel corpus English - Swahili

AfLaT '09 Proceedings of the First Workshop on Language Technologies for African Languages
Improving the extraction of bilingual terminology from Wikipedia

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
Retrieving bilingual verb-noun collocations by integrating cross-language category hierarchies

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
On the use of comparable corpora to improve SMT performance

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Text data acquisition for domain-specific language models

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
A fast and accurate method for detecting English-Japanese parallel texts

MLRI '06 Proceedings of the Workshop on Multilingual Language Resources and Interoperability
Improved sentence alignment on parallel web pages using a stochastic tree alignment model

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Language and translation model adaptation using comparable corpora

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Selecting relevant text subsets from web-data for building topic specific language models

NAACL-Short '06 Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers
A fast method for parallel document identification

NAACL-Short '07 Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers
Mining translations of web queries from web click-through data

AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 2
A simple sentence-level extraction algorithm for comparable data

NAACL-Short '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers
SemEval-2007 task 11: English lexical sample task via English-Chinese parallel text

SemEval '07 Proceedings of the 4th International Workshop on Semantic Evaluations
Pseudo-aligned multilingual corpora

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Mining a comparable text corpus for a Vietnamese - French statistical machine translation system

StatMT '09 Proceedings of the Fourth Workshop on Statistical Machine Translation
Automatically learning qualia structures from the web

DeepLA '05 Proceedings of the ACL-SIGLEX Workshop on Deep Lexical Acquisition
Frontiers in linguistic annotation for lower-density languages

LAC '06 Proceedings of the Workshop on Frontiers in Linguistically Annotated Corpora 2006
A beam-search extraction algorithm for comparable data

ACLShort '09 Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
Mining bilingual data from the web with adaptively learnt patterns

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
Train the machine with what it can learn: corpus selection for SMT

BUCC '09 Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora
Exploiting comparable corpora with TER and TERp

BUCC '09 Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora
Improved statistical machine translation using monolingually-derived paraphrases

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1
Discriminative corpus weight estimation for machine translation

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
Constructing a large scale text corpus based on the grid and trustworthiness

TSD'07 Proceedings of the 10th international conference on Text, speech and dialogue
QRselect: a user-driven system for collecting translation document pairs from the web

ICADL'07 Proceedings of the 10th international conference on Asian digital libraries: looking back 10 years and forging new frontiers
A refinement framework for cross language text categorization

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
Learning medical ontologies from the web

AIME'07 Proceedings of the 2007 conference on Knowledge management for health care procedures
Unsupervised translation disambiguation based on maximum web bilingual relatedness: web as lexicon

FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 7
Extracting parallel sentences from comparable corpora using document level alignment

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Extracting sense-disambiguated example sentences from parallel corpora

WDE '09 Proceedings of the 1st Workshop on Definition Extraction
Extracting parallel fragments from comparable corpora for data-to-text generation

INLG '10 Proceedings of the 6th International Natural Language Generation Conference
An empirical study on web mining of parallel data

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Large scale parallel document mining for machine translation

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
A kernel regression framework for SMT

Machine Translation
A novel method for bilingual web page acquisition from search engine web records

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Automatic extraction of acronym definitions from the Web

Applied Intelligence
Resources for Turkish morphological processing

Language Resources and Evaluation
Crowdsourcing translation: professional quality from non-professionals

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
No free lunch: brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Bilingual lexicon extraction from comparable corpora enhanced with parallel corpora

BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
Two ways to use a noisy parallel news corpus for improving statistical machine translation

BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
Extracting parallel phrases from comparable data

BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
Active learning with multiple annotations for comparable data classification task

BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
Identifying parallel documents from a large bilingual collection of texts: application to parallel article extraction in Wikipedia

BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
Unsupervised alignment of comparable data and text resources

BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
Building a web-based parallel corpus and filtering out machine-translated text

BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
Exploring the sawa corpus: collection and deployment of a parallel corpus English--Swahili

Language Resources and Evaluation
Parallel corpora and WordSpace models: using a third language as an interlingua to enrich multilingual resources

International Journal of Information and Communication Technology
Parallel sentence generation from comparable corpora for improved SMT

Machine Translation
Automatic evaluation of ontologies (AEON)

ISWC'05 Proceedings of the 4th international conference on The Semantic Web
Extracting english-korean transliteration pairs from web corpora

ICCPOL'06 Proceedings of the 21st international conference on Computer Processing of Oriental Languages: beyond the orient: the research challenges ahead
Multilingual sentence hunter

WISE'05 Proceedings of the 2005 international conference on Web Information Systems Engineering
Construct trilingual parallel corpus on demand

ISCSLP'06 Proceedings of the 5th international conference on Chinese Spoken Language Processing
A minimally supervised approach for detecting and ranking document translation pairs

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation
CEU-UPV English-Spanish system for WMT11

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation
Watermarking the outputs of structured prediction with an application in statistical machine translation

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Automatic identification of parallel documents with light or without linguistic resources

AI'05 Proceedings of the 18th Canadian Society conference on Advances in Artificial Intelligence
Mining parenthetical translations for polish-english lexica

CICLing'10 Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing
Automatic acquisition of chinese–english parallel corpus from the web

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
Enabling users to create their own web-based machine translation engine

Proceedings of the 21st international conference companion on World Wide Web
Extracting parallel paragraphs and sentences from english-persian translated documents

AIRS'11 Proceedings of the 7th Asia conference on Information Retrieval Technology
A framework for semantic discovery of web services

iUBICOM'10 Proceedings of the 5th international conference on Ubiquitous and Collaborative Computing
Measuring semantic similarity between words by removing noise and redundancy in web snippets

Concurrency and Computation: Practice & Experience
Finding translations in scanned book collections

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
AEON - An approach to the automatic evaluation of ontologies

Applied Ontology - Ontological Foundations of Conceptual Modelling
Translation techniques in cross-language information retrieval

ACM Computing Surveys (CSUR)
Transliteration mining using large training and test sets

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Why not grab a free lunch?: mining large corpora for parallel sentences to improve translation modeling

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Context similarity measure using Fuzzy Formal Concept Analysis

Proceedings of the Second International Conference on Computational Science, Engineering and Information Technology
Zero anaphora resolution in chinese and its application in chinese-english machine translation

NLDB'07 Proceedings of the 12th international conference on Applications of Natural Language to Information Systems
Collaboratively built semi-structured content and Artificial Intelligence: The story so far

Artificial Intelligence
Leveraging arabic-english bilingual corpora with crowd sourcing-based annotation for arabic-hebrew SMT

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume 2
Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora

Information Retrieval
Manifold alignment preserving global geometry

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Parallel corpora have become an essential resource for work in multilingual natural language processing. In this article, we report on our work using the STRAND system for mining parallel text on the World Wide Web,first reviewing the original algorithm and results and then presenting a set of significant enhancements. These enhancements include the use of supervised learning based on structural features of documents to improve classification performance, a new content-based measure of translational equivalence, and adaptation of the system to take advantage of the Internet Archive for mining parallel text from the Web on a large scale. Finally, the value of these techniques is demonstrated in the construction of a significant parallel corpus for a low-density language pair.