Using main content extraction to improve performance of Vietnamese web page classification

  • Authors:
  • Nguyen Minh Trung;Nguyen Duc Tam;Nguyen Hong Phuong

  • Affiliations:
  • Hanoi University of Science and Technology;Hanoi University of Science and Technology;Hanoi University of Science and Technology

  • Venue:
  • Proceedings of the Second Symposium on Information and Communication Technology
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Web page classification is the process of categorizing a web page into one or more classes which have been predetermined. If we remove all HTML tags from a web page, then this process can be considered as a text classification problem. However, this approach does not achieve high precision due to noisy contents, which always exist in regular HTML documents. To address this problem, we propose using a content extraction method to extract the main contents of the web pages and use them for the classification task. Experimental results show that the proposed method significantly improves the precision of the Vietnamese web page classification from 71% to 80%. It also indicates that context features such as the anchor texts of reference links and the contents of tags "TITLE" can use as a good summarization for web page contents.