SDDB: a self-dependent and data-based method for constructing bilingual dictionary from the web

  • Authors:
  • Jun Han;Lizhu Zhou;Juan Liu

  • Affiliations:
  • Department of Computer Science and Technology, Tsinghua University, Beijing, China;Department of Computer Science and Technology, Tsinghua University, Beijing, China;Department of Computer Science and Technology, Tsinghua University, Beijing, China

  • Venue:
  • APWeb'11 Proceedings of the 13th Asia-Pacific web conference on Web technologies and applications
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

As various data on the World Wide Web are becoming massively available, more and more traditional algorithm centric problems turn to find their solutions in a data centric way. In this paper, we present such a typical example - a Self-Dependent and Data-Based (SDDB) method for building bilingual dictionaries from the Web. Being different from many existing methods that focus on finding effective algorithms in sentence segmentation and word alignment through machine learning etc, SDDB strongly relies on the data of bilingual web pages from Chinese Web that are big enough to cover the terms for building dictionaries. The algorithms of SDDB are based on statistics of bilingual entries that are easy to collect from the parenthetical sentences from the Web. They are simply linear to the number of sentences and hence are scalable. In addition, rather than depending on pre-existing corpus to build bilingual dictionaries, which is commonly adopted in many existing methods, SDDB constructs the corpus from the Web by itself. This characterizes SDDB as an automatic method covering the complete process of building a bilingual dictionary from scratch. A Chinese-English dictionary with over 4 million Chinese-English entries and over 6 million English-Chinese entries built by SDDB shows a competitive performance to a popular commercial products on the Web.