Word segmentation and recognition for web document framework

Authors:
Chi-Hung Chi;Chen Ding;Andrew Lim
Affiliations:
School of Computing, National University of Singapore, Lower Kent Ridge Road, Singapore 119260;School of Computing, National University of Singapore, Lower Kent Ridge Road, Singapore 119260;School of Computing, National University of Singapore, Lower Kent Ridge Road, Singapore 119260
Venue:
Proceedings of the eighth international conference on Information and knowledge management
Year:
1999

Citing 2
Cited 3

Chinese text segmentation for text retrieval: achievements and problems

Journal of the American Society for Information Science
Word identification for Mandarin Chinese sentences

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 1

Web scale NLP: a case study on url word breaking

Proceedings of the 20th international conference on World wide web
Buy, sell, or hold? information extraction from stock analyst reports

CONTEXT'11 Proceedings of the 7th international and interdisciplinary conference on Modeling and using context
Exploring URL hit priors for web search

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

It is observed that a better approach to Web information understanding is to base on its document framework, which is mainly consisted of (i) the title and the URL name of the page, (ii) the titles and the URL names of the Web pages that it points to, (iii) the alternative information source for the embedded Web objects, and (iv) its linkage to other Web pages of the same document. Investigation reveals that a high percentage of words inside the document framework are “compound words” which cannot be understood by ordinary dictionaries. They might be abbreviations or acronyms, or concatenations of several (partial) words. To recover the content hierarchy of Web documents, we propose a new word segmentation and recognition mechanism to understand the information derived from the Web document framework. A maximal bi-directional matching algorithm with heuristic rules is used to resolve ambiguous segmentation and meaning in compound words. An adaptive training process is further employed to build a dictionary of recognisable abbreviations and acronyms. Empirical results show that over 75% of the compound words found in the Web document framework can be understood by our mechanism. With the training process, the success rate of recognising compound words can be increased to about 90%.