Classification of news web documents based on structural features

  • Authors:
  • Shisanu Tongchim;Virach Sornlertlamvanich;Hitoshi Isahara

  • Affiliations:
  • Thai Computational Linguistics Laboratory, National Institute of Information and Communications Technology, Klong 1, Klong Luang, Pathumthani, Thailand;Thai Computational Linguistics Laboratory, National Institute of Information and Communications Technology, Klong 1, Klong Luang, Pathumthani, Thailand;Thai Computational Linguistics Laboratory, National Institute of Information and Communications Technology, Klong 1, Klong Luang, Pathumthani, Thailand

  • Venue:
  • FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

The motivation of this work comes from the need of a Thai web corpus for testing our information retrieval algorithm. Two collections of news web documents are gathered from two different Thai newspaper web sites. Our goal is to find a simple yet effective method to extract news articles from these web collections. We explore the use of machine learning methods to distinguish article pages from non-article pages, e.g. table of contents, advertisements. Then, the selected web articles are compared in a fine-grained manner in order to find informative structures. Both steps of information extraction utilize the structural features of web documents rather than the extracted keywords or terms. Thus, the inherent errors of word segmentation, one of the major problems in Thai text processing, do not affect to this method.