Word segmentation and recognition for web document framework

  • Authors:
  • Chi-Hung Chi;Chen Ding;Andrew Lim

  • Affiliations:
  • School of Computing, National University of Singapore, Lower Kent Ridge Road, Singapore 119260;School of Computing, National University of Singapore, Lower Kent Ridge Road, Singapore 119260;School of Computing, National University of Singapore, Lower Kent Ridge Road, Singapore 119260

  • Venue:
  • Proceedings of the eighth international conference on Information and knowledge management
  • Year:
  • 1999

Quantified Score

Hi-index 0.00

Visualization

Abstract

It is observed that a better approach to Web information understanding is to base on its document framework, which is mainly consisted of (i) the title and the URL name of the page, (ii) the titles and the URL names of the Web pages that it points to, (iii) the alternative information source for the embedded Web objects, and (iv) its linkage to other Web pages of the same document. Investigation reveals that a high percentage of words inside the document framework are “compound words” which cannot be understood by ordinary dictionaries. They might be abbreviations or acronyms, or concatenations of several (partial) words. To recover the content hierarchy of Web documents, we propose a new word segmentation and recognition mechanism to understand the information derived from the Web document framework. A maximal bi-directional matching algorithm with heuristic rules is used to resolve ambiguous segmentation and meaning in compound words. An adaptive training process is further employed to build a dictionary of recognisable abbreviations and acronyms. Empirical results show that over 75% of the compound words found in the Web document framework can be understood by our mechanism. With the training process, the success rate of recognising compound words can be increased to about 90%.