A brief survey of web data extraction tools
ACM SIGMOD Record
Toward Learning Based Web Query Processing
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Discovering informative content blocks from Web documents
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Detecting web page structure for adaptive viewing on small form factor devices
WWW '03 Proceedings of the 12th international conference on World Wide Web
Eliminating noisy information in Web pages for data mining
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
The volume and evolution of web page templates
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Automatic Identification of Informative Sections of Web Pages
IEEE Transactions on Knowledge and Data Engineering
Proceedings of the 15th international conference on World Wide Web
A fast and robust method for web page template detection and removal
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Page-level template detection via isotonic smoothing
Proceedings of the 16th international conference on World Wide Web
A densitometric approach to web page segmentation
Proceedings of the 17th ACM conference on Information and knowledge management
ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Blog post and comment extraction using information quantity of web format
AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
Web page DOM node characterization and its application to page segmentation
IMSAA'09 Proceedings of the 3rd IEEE international conference on Internet multimedia services architecture and applications
Page segmentation by web content clustering
Proceedings of the International Conference on Web Intelligence, Mining and Semantics
Repetition-based web page segmentation by detecting tag patterns for small-screen devices
IEEE Transactions on Consumer Electronics
CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume 2
Hi-index | 0.00 |
The vast amount of user-generated content on the Web has increased the need for handling the problem of automatically processing content in web pages. The segmentation of web pages and noise (non-informative segment) removal are important pre-processing steps in a variety of applications such as sentiment analysis, text summarization and information retrieval. Currently, these two tasks tend to be handled separately or are handled together without emphasizing the diversity of the web corpora and the web page type detection. We present a unified approach that is able to provide robust identification of informative textual parts in web pages along with accurate type detection. The proposed algorithm takes into account visual and non-visual characteristics of a web page and is able to remove noisy parts from three major categories of pages which contain user-generated content (News, Blogs, Discussions). Based on a human annotated corpus consisting of diverse topics, domains and templates, we demonstrate the learning abilities of our algorithm, we examine its effectiveness in extracting the informative textual parts and its usage as a rule-based classifier for web page type detection in a realistic web setting.