The nature of statistical learning theory
The nature of statistical learning theory
Web classification using support vector machine
Proceedings of the 4th international workshop on Web information and data management
DOM-based content extraction of HTML documents
WWW '03 Proceedings of the 12th international conference on World Wide Web
Asymptotic behaviors of support vector machines with Gaussian kernel
Neural Computation
The volume and evolution of web page templates
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
OCFS: optimal orthogonal centroid feature selection for text categorization
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Separating XHTML content from navigation clutter using DOM-structure block analysis
Proceedings of the sixteenth ACM conference on Hypertext and hypermedia
Fast webpage classification using URL features
Proceedings of the 14th ACM international conference on Information and knowledge management
A comparison of implicit and explicit links for web page classification
Proceedings of the 15th international conference on World Wide Web
Importance of HTML structural elements and metadata in automated subject classification
ECDL'05 Proceedings of the 9th European conference on Research and Advanced Technology for Digital Libraries
A comparison of methods for multiclass support vector machines
IEEE Transactions on Neural Networks
Hi-index | 0.00 |
Web page classification is the process of categorizing a web page into one or more classes which have been predetermined. If we remove all HTML tags from a web page, then this process can be considered as a text classification problem. However, this approach does not achieve high precision due to noisy contents, which always exist in regular HTML documents. To address this problem, we propose using a content extraction method to extract the main contents of the web pages and use them for the classification task. Experimental results show that the proposed method significantly improves the precision of the Vietnamese web page classification from 71% to 80%. It also indicates that context features such as the anchor texts of reference links and the contents of tags "TITLE" can use as a good summarization for web page contents.