SDDB: a self-dependent and data-based method for constructing bilingual dictionary from the web

Authors:
Jun Han;Lizhu Zhou;Juan Liu
Affiliations:
Department of Computer Science and Technology, Tsinghua University, Beijing, China;Department of Computer Science and Technology, Tsinghua University, Beijing, China;Department of Computer Science and Technology, Tsinghua University, Beijing, China
Venue:
APWeb'11 Proceedings of the 13th Asia-Pacific web conference on Web technologies and applications
Year:
2011

Citing 15
Cited 0

Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Efficient algorithms for mining outliers from large data sets

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Using Bilingual Web Data to Mine and Rank Translations

IEEE Intelligent Systems
Using the web to obtain frequencies for unseen bigrams

Computational Linguistics - Special issue on web as corpus
The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
Using the web for automated translation extraction in cross-language information retrieval

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Core Vector Machines: Fast SVM Training on Very Large Data Sets

The Journal of Machine Learning Research
Scaling to very very large corpora for natural language disambiguation

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Web-based models for natural language processing

ACM Transactions on Speech and Language Processing (TSLP)
Mining key phrase translations from web corpora

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Cheap and fast---but is it good?: evaluating non-expert annotations for natural language tasks

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Named entity translation with web mining and transliteration

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Web-Based Transliteration of Person Names

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Mining bilingual data from the web with adaptively learnt patterns

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
Named entity transcription with pair n-gram models

NEWS '09 Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration

Quantified Score

Hi-index	0.00

Visualization

Abstract

As various data on the World Wide Web are becoming massively available, more and more traditional algorithm centric problems turn to find their solutions in a data centric way. In this paper, we present such a typical example - a Self-Dependent and Data-Based (SDDB) method for building bilingual dictionaries from the Web. Being different from many existing methods that focus on finding effective algorithms in sentence segmentation and word alignment through machine learning etc, SDDB strongly relies on the data of bilingual web pages from Chinese Web that are big enough to cover the terms for building dictionaries. The algorithms of SDDB are based on statistics of bilingual entries that are easy to collect from the parenthetical sentences from the Web. They are simply linear to the number of sentences and hence are scalable. In addition, rather than depending on pre-existing corpus to build bilingual dictionaries, which is commonly adopted in many existing methods, SDDB constructs the corpus from the Web by itself. This characterizes SDDB as an automatic method covering the complete process of building a bilingual dictionary from scratch. A Chinese-English dictionary with over 4 million Chinese-English entries and over 6 million English-Chinese entries built by SDDB shows a competitive performance to a popular commercial products on the Web.