Mining the Web for bilingual text

Authors:
Philip Resnik
Affiliations:
University of Maryland, College Park, MD
Venue:
ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Year:
1999

Citing 2
Cited 62

A statistical approach to machine translation

Computational Linguistics
Assessing agreement on classification tasks: the kappa statistic

Computational Linguistics

OCELOT: a system for summarizing Web pages

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Mining the web to create minority language corpora

Proceedings of the tenth international conference on Information and knowledge management
Comparing cross-language query expansion techniques by degrading translation resources

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Building Bilingual Dictionaries from Parallel Web Documents

Proceedings of the 24th BCS-IRSG European Colloquium on IR Research: Advances in Information Retrieval
Knowledge Extraction from Bilingual Corpora

Information Extraction: Towards Scalable, Adaptable Systems
Building Parallel Corpora by Automatic Title Alignment

ICADL '02 Proceedings of the 5th International Conference on Asian Digital Libraries: Digital Libraries: People, Knowledge, and Technology
Cross-Lingual Document Similarity Calculation Using the Multilingual Thesaurus EUROVOC

CICLing '02 Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text Processing
Research to Improve Cross-Language Retrieval - Position Paper for CLEF

CLEF '00 Revised Papers from the Workshop of Cross-Language Evaluation Forum on Cross-Language Information Retrieval and Evaluation
Automatic generation of English/Chinese thesaurus based on a parallel corpus in laws

Journal of the American Society for Information Science and Technology
Automatic construction of English/Chinese parallel corpora

Journal of the American Society for Information Science and Technology
Character N-Gram Tokenization for European Language Text Retrieval

Information Retrieval
Introduction to the special issue on the web as corpus

Computational Linguistics - Special issue on web as corpus
The Web as a parallel corpus

Computational Linguistics - Special issue on web as corpus
Using the web to obtain frequencies for unseen bigrams

Computational Linguistics - Special issue on web as corpus
Automatic association of web directories with word senses

Computational Linguistics - Special issue on web as corpus
Distinguishing systems and distinguishing senses: new evaluation methods for Word Sense Disambiguation

Natural Language Engineering
Word-for-word glossing with contextually similar words

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Translating unknown queries with web corpora for cross-language information retrieval

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Resource selection for domain-specific cross-lingual IR

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Building parallel corpora by automatic title alignment using length-based and text-based approaches

Information Processing and Management: an International Journal
Building Minority Language Corpora by Learning to Generate Web Search Queries

Knowledge and Information Systems
Technical issues of cross-language information retrieval: a review

Information Processing and Management: an International Journal - Special issue: Cross-language information retrieval
Improved cross-language retrieval using backoff translation

HLT '01 Proceedings of the first international conference on Human language technology research
Bootstrapping bilingual data using consensus translation for a multilingual instant messaging system

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Organizing encyclopedic knowledge based on the web and its application to question answering

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
An unsupervised method for word sense tagging using parallel corpora

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Exploiting parallel texts for word sense disambiguation: an empirical study

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Utilizing the world wide web as an encyclopedia: extracting term descriptions from semi-structured texts

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Corpus-based Learning of Analogies and Semantic Relations

Machine Learning
Parallel texts

Natural Language Engineering
An unsupervised method for multilingual word sense tagging using parallel corpora: a preliminary investigation

WWSM '00 Proceedings of the ACL-2000 workshop on Word senses and multi-linguality - Volume 8
Using the web as a bilingual dictionary

DMMT '01 Proceedings of the workshop on Data-driven methods in machine translation - Volume 14
From words to corpora: recognizing translation

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Using the web to overcome data sparseness

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Efficient optimization for bilingual sentence alignment based on linear regression

HLT-NAACL-PARALLEL '03 Proceedings of the HLT-NAACL 2003 Workshop on Building and using parallel texts: data driven machine translation and beyond - Volume 3
Exploiting the Web as the multilingual corpus for unknown query translation

Journal of the American Society for Information Science and Technology
Automatic support for the alignment of multilingual Web sites: Research Articles

Journal of Software Maintenance and Evolution: Research and Practice
Building and Using a Lexical Knowledge Base of Near-Synonym Differences

Computational Linguistics
Filtering or adapting: two strategies to exploit noisy parallel corpora for cross-language information retrieval

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
A statistical model for near-synonym choice

ACM Transactions on Speech and Language Processing (TSLP)
Creating and exploiting a comparable corpus in cross-language information retrieval

ACM Transactions on Information Systems (TOIS)
Creating multilingual translation lexicons with regional variations using web corpora

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Automatic acquisition of English topic signatures based on a second language

ACLstudent '04 Proceedings of the ACL 2004 workshop on Student research
Word sense disambiguation using sense examples automatically acquired from a second language

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
An Intelligent Web Agent to Mine Bilingual Parallel Pages via Automatic Discovery of URL Pairing Patterns

WI-IATW '07 Proceedings of the 2007 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Workshops
Using Web resources to construct multilingual medical thesaurus for cross-language medical information retrieval

Decision Support Systems
Focused web crawling in the acquisition of comparable corpora

Information Retrieval
Automatic extraction of translations from web-based bilingual materials

Machine Translation
The web as a platform to build machine translation resources

Proceedings of the 2009 international workshop on Intercultural collaboration
Concept unification of terms in different languages via web mining for Information Retrieval

Information Processing and Management: an International Journal
An unsupervised method for multilingual word sense tagging using parallel corpora: a preliminary investigation

WorkSense '00 Proceedings of the ACL-2000 Workshop on Word Senses and Multi-Linguality
Learning domain-specific information extraction patterns from the Web

IEBeyondDoc '06 Proceedings of the Workshop on Information Extraction Beyond The Document
Automatic construction of cross-lingual networks of concepts from the Hong Kong SAR police department

ISI'03 Proceedings of the 1st NSF/NIJ conference on Intelligence and security informatics
Creating a Persian-English comparable corpus

CLEF'10 Proceedings of the 2010 international conference on Multilingual and multimodal information access evaluation: cross-language evaluation forum
Mining large-scale comparable corpora from Chinese-English news collections

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Automatic filtering of bilingual corpora for statistical machine translation

NLDB'05 Proceedings of the 10th international conference on Natural Language Processing and Information Systems
Finding translations in scanned book collections

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Translation techniques in cross-language information retrieval

ACM Computing Surveys (CSUR)
Design of a hybrid high quality machine translation system

EACL 2012 Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra)
Rediscovering ACL discoveries through the lens of ACL anthology network citing sentences

ACL '12 Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries
Towards automatic assessment of government web sites

Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics
The ACL anthology network corpus

Language Resources and Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

STRAND (Resnik, 1998) is a language-independent system for automatic discovery of text in parallel translation on the World Wide Web. This paper extends the preliminary STRAND results by adding automatic language identification, scaling up by orders of magnitude, and formally evaluating performance. The most recent end-product is an automatically acquired parallel corpus comprising 2491 English-French document pairs, approximately 1.5 million words per language.