Using main content extraction to improve performance of Vietnamese web page classification

Authors:
Nguyen Minh Trung;Nguyen Duc Tam;Nguyen Hong Phuong
Affiliations:
Hanoi University of Science and Technology;Hanoi University of Science and Technology;Hanoi University of Science and Technology
Venue:
Proceedings of the Second Symposium on Information and Communication Technology
Year:
2011

Citing 11
Cited 0

The nature of statistical learning theory

The nature of statistical learning theory
Web classification using support vector machine

Proceedings of the 4th international workshop on Web information and data management
DOM-based content extraction of HTML documents

WWW '03 Proceedings of the 12th international conference on World Wide Web
Asymptotic behaviors of support vector machines with Gaussian kernel

Neural Computation
The volume and evolution of web page templates

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
OCFS: optimal orthogonal centroid feature selection for text categorization

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Separating XHTML content from navigation clutter using DOM-structure block analysis

Proceedings of the sixteenth ACM conference on Hypertext and hypermedia
Fast webpage classification using URL features

Proceedings of the 14th ACM international conference on Information and knowledge management
A comparison of implicit and explicit links for web page classification

Proceedings of the 15th international conference on World Wide Web
Importance of HTML structural elements and metadata in automated subject classification

ECDL'05 Proceedings of the 9th European conference on Research and Advanced Technology for Digital Libraries
A comparison of methods for multiclass support vector machines

IEEE Transactions on Neural Networks

Quantified Score

Hi-index	0.00

Visualization

Abstract

Web page classification is the process of categorizing a web page into one or more classes which have been predetermined. If we remove all HTML tags from a web page, then this process can be considered as a text classification problem. However, this approach does not achieve high precision due to noisy contents, which always exist in regular HTML documents. To address this problem, we propose using a content extraction method to extract the main contents of the web pages and use them for the classification task. Experimental results show that the proposed method significantly improves the precision of the Vietnamese web page classification from 71% to 80%. It also indicates that context features such as the anchor texts of reference links and the contents of tags "TITLE" can use as a good summarization for web page contents.