Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
IEPAD: information extraction based on pattern discovery
Proceedings of the 10th international conference on World Wide Web
Extracting Patterns and Relations from the World Wide Web
WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Computational Linguistics - Special issue on web as corpus
The mathematics of statistical machine translation: parameter estimation
Computational Linguistics - Special issue on using large corpora: II
Anchor text mining for translation of Web queries: A transitive translation approach
ACM Transactions on Information Systems (TOIS)
Word identification for Mandarin Chinese sentences
COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 1
Automatic acquisition of hyponyms from large text corpora
COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 2
COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
Translating unknown queries with web corpora for cross-language information retrieval
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Using the web for automated translation extraction in cross-language information retrieval
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Improving Machine Translation Performance by Exploiting Non-Parallel Corpora
Computational Linguistics
A DOM tree alignment model for mining parallel data from the web
ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Mining new word translations from comparable corpora
COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Mining key phrase translations from web corpora
HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Learning source-target surface patterns for web-based terminology translation
ACLdemo '05 Proceedings of the ACL 2005 on Interactive poster and demonstration sessions
Named entity translation with web mining and transliteration
IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Mining name translations from entity graph mapping
EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
An empirical study on web mining of parallel data
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
A novel method for bilingual web page acquisition from search engine web records
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
SDDB: a self-dependent and data-based method for constructing bilingual dictionary from the web
APWeb'11 Proceedings of the 13th Asia-Pacific web conference on Web technologies and applications
Engkoo: mining the web for language learning
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Systems Demonstrations
Graph-based bilingual sentence alignment from large scale web pages
NLDB'11 Proceedings of the 16th international conference on Natural language processing and information systems
Mining entity translations from comparable corpora: a holistic graph mapping approach
Proceedings of the 20th ACM international conference on Information and knowledge management
Mining parenthetical translations for polish-english lexica
CICLing'10 Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing
Efficient Entity Translation Mining: A Parallelized Graph Alignment Approach
ACM Transactions on Information Systems (TOIS)
Hi-index | 0.00 |
Mining bilingual data (including bilingual sentences and terms) from the Web can benefit many NLP applications, such as machine translation and cross language information retrieval. In this paper, based on the observation that bilingual data in many web pages appear collectively following similar patterns, an adaptive pattern-based bilingual data mining method is proposed. Specifically, given a web page, the method contains four steps: 1) preprocessing: parse the web page into a DOM tree and segment the inner text of each node into snippets; 2) seed mining: identify potential translation pairs (seeds) using a word based alignment model which takes both translation and transliteration into consideration; 3) pattern learning: learn generalized patterns with the identified seeds; 4) pattern based mining: extract all bilingual data in the page using the learned patterns. Our experiments on Chinese web pages produced more than 7.5 million pairs of bilingual sentences and more than 5 million pairs of bilingual terms, both with over 80% accuracy.