Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Efficient algorithms for mining outliers from large data sets
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Using Bilingual Web Data to Mine and Rank Translations
IEEE Intelligent Systems
Using the web to obtain frequencies for unseen bigrams
Computational Linguistics - Special issue on web as corpus
The mathematics of statistical machine translation: parameter estimation
Computational Linguistics - Special issue on using large corpora: II
Using the web for automated translation extraction in cross-language information retrieval
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Core Vector Machines: Fast SVM Training on Very Large Data Sets
The Journal of Machine Learning Research
Scaling to very very large corpora for natural language disambiguation
ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Web-based models for natural language processing
ACM Transactions on Speech and Language Processing (TSLP)
Mining key phrase translations from web corpora
HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Cheap and fast---but is it good?: evaluating non-expert annotations for natural language tasks
EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Named entity translation with web mining and transliteration
IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Web-Based Transliteration of Person Names
WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Mining bilingual data from the web with adaptively learnt patterns
ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
Named entity transcription with pair n-gram models
NEWS '09 Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration
Hi-index | 0.00 |
As various data on the World Wide Web are becoming massively available, more and more traditional algorithm centric problems turn to find their solutions in a data centric way. In this paper, we present such a typical example - a Self-Dependent and Data-Based (SDDB) method for building bilingual dictionaries from the Web. Being different from many existing methods that focus on finding effective algorithms in sentence segmentation and word alignment through machine learning etc, SDDB strongly relies on the data of bilingual web pages from Chinese Web that are big enough to cover the terms for building dictionaries. The algorithms of SDDB are based on statistics of bilingual entries that are easy to collect from the parenthetical sentences from the Web. They are simply linear to the number of sentences and hence are scalable. In addition, rather than depending on pre-existing corpus to build bilingual dictionaries, which is commonly adopted in many existing methods, SDDB constructs the corpus from the Web by itself. This characterizes SDDB as an automatic method covering the complete process of building a bilingual dictionary from scratch. A Chinese-English dictionary with over 4 million Chinese-English entries and over 6 million English-Chinese entries built by SDDB shows a competitive performance to a popular commercial products on the Web.