A DOM tree alignment model for mining parallel data from the web

  • Authors:
  • Lei Shi;Cheng Niu;Ming Zhou;Jianfeng Gao

  • Affiliations:
  • Microsoft Research Asia, Beijing, P. R. China;Microsoft Research Asia, Beijing, P. R. China;Microsoft Research Asia, Beijing, P. R. China;Microsoft Research, Redmond, WA

  • Venue:
  • ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper presents a new web mining scheme for parallel data acquisition. Based on the Document Object Model (DOM), a web page is represented as a DOM tree. Then a DOM tree alignment model is proposed to identify the translationally equivalent texts and hyperlinks between two parallel DOM trees. By tracing the identified parallel hyperlinks, parallel web documents are recursively mined. Compared with previous mining schemes, the benchmarks show that this new mining scheme improves the mining coverage, reduces mining bandwidth, and enhances the quality of mined parallel sentences.