Classification of news web documents based on structural features

Authors:
Shisanu Tongchim;Virach Sornlertlamvanich;Hitoshi Isahara
Affiliations:
Thai Computational Linguistics Laboratory, National Institute of Information and Communications Technology, Klong 1, Klong Luang, Pathumthani, Thailand;Thai Computational Linguistics Laboratory, National Institute of Information and Communications Technology, Klong 1, Klong Luang, Pathumthani, Thailand;Thai Computational Linguistics Laboratory, National Institute of Information and Communications Technology, Klong 1, Klong Luang, Pathumthani, Thailand
Venue:
FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
Year:
2006

Citing 10
Cited 0

Web classification using support vector machine

Proceedings of the 4th international workshop on Web information and data management
Measuring Structural Similarity Among Web Documents: Preliminary Results

EP '98/RIDT '98 Proceedings of the 7th International Conference on Electronic Publishing, Held Jointly with the 4th International Conference on Raster Imaging and Digital Typography: Electronic Publishing, Artistic Imaging, and Digital Typography
Web Mining: Information and Pattern Discovery on the World Wide Web

ICTAI '97 Proceedings of the 9th International Conference on Tools with Artificial Intelligence
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining

The Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Eliminating noisy information in Web pages for data mining

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
A bag of paths model for measuring structural similarity in Web documents

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Machine learning methods for Chinese web page categorization

CLPW '00 Proceedings of the second workshop on Chinese language processing: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 12
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Factors affecting web page similarity

ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research
Feature selection with rough sets for web page classification

Transactions on Rough Sets II

Quantified Score

Hi-index	0.00

Visualization

Abstract

The motivation of this work comes from the need of a Thai web corpus for testing our information retrieval algorithm. Two collections of news web documents are gathered from two different Thai newspaper web sites. Our goal is to find a simple yet effective method to extract news articles from these web collections. We explore the use of machine learning methods to distinguish article pages from non-article pages, e.g. table of contents, advertisements. Then, the selected web articles are compared in a fine-grained manner in order to find informative structures. Both steps of information extraction utilize the structural features of web documents rather than the extracted keywords or terms. Thus, the inherent errors of word segmentation, one of the major problems in Thai text processing, do not affect to this method.