An Intelligent Web Agent to Mine Bilingual Parallel Pages via Automatic Discovery of URL Pairing Patterns

  • Authors:
  • Chunyu Kit;Jessica Yee Ha Ng

  • Affiliations:
  • -;-

  • Venue:
  • WI-IATW '07 Proceedings of the 2007 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Workshops
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper describes an intelligent agent to facilitate bitext mining from the Web via automatic discovery of URL pairing patterns (or keys) for retrieving parallel web pages. The linking power of a key, defined as the number of URL pairs that it can match, is used as the objective function for the search for the best set of keys that can find the greatest number of web page pairs within a bilingual website. Our experiments show that, with no prior knowledge such as ad hoc heuristics, no labelled data for training and no similarity analysis of Web page structure and content that are commonly involved in the existing approaches, a best-first search to approximate this optimization with an empirical threshold can recognize 98.1% true parallel web pages and discover many irregular pairing patterns that are unlikely to be discovered by other approaches.