Retrieving informative content from web pages with conditional learning of support vector machines and semantic analysis

Authors:
Piotr Ładyżyński;Przemysław Grzegorzewski
Affiliations:
Faculty of Mathematics and Computer Science, Warsaw University of Technology, Warsaw, Poland;Faculty of Mathematics and Computer Science, Warsaw University of Technology, Warsaw, Poland and Faculty of Mathematics and Computer Science, Warsaw University of Technology, Warsaw, Poland
Venue:
ICAISC'12 Proceedings of the 11th international conference on Artificial Intelligence and Soft Computing - Volume Part II
Year:
2012

Citing 9
Cited 0

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Discovering informative content blocks from Web documents

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Automatic web news extraction using tree edit distance

Proceedings of the 13th international conference on World Wide Web
Using the structure of Web sites for automatic segmentation of tables

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Pegasos: Primal Estimated sub-GrAdient SOlver for SVM

Proceedings of the 24th international conference on Machine learning
Content Extraction from News Pages Using Particle Swarm Optimization on Linguistic and Structural Features

WI '07 Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence
Web page DOM node characterization and its application to page segmentation

IMSAA'09 Proceedings of the 3rd IEEE international conference on Internet multimedia services architecture and applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a new system which is able to extract informative content from the news pages and divide it into prescribed sections. The system is based on the machine learning classifier incorporating different kind of information (styles, linguistic information, structural information, content semantic analysis) and conditional learning. According to empirical results the suggested system seems to be a promising tool for extracting information from web.