An adaptive bottom up clustering approach for web news extraction

Authors:
Jinlin Chen;Subash Shankar;Angela Kelly;Serigne Gningue;Rathika Rajaravivarma
Affiliations:
Computer Science Dept., Queens College, CUNY, Flushing, NY;Hunter College, CUNY, New York, NY;Lehman College, CUNY, Bronx, NY;Lehman College, CUNY, Bronx, NY;City Tech., CUNY, Brooklyn, NY
Venue:
WOCC'09 Proceedings of the 18th international conference on Wireless and Optical Communications Conference
Year:
2009

Citing 11
Cited 0

Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction

Proceedings of the 10th international conference on World Wide Web
Function-based object model towards website adaptation

Proceedings of the 10th international conference on World Wide Web
A brief survey of web data extraction tools

ACM SIGMOD Record
Hierarchical Wrapper Induction for Semistructured Information Sources

Autonomous Agents and Multi-Agent Systems
Boosted Wrapper Induction

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Visual Based Content Understanding towards Web Adaptation

AH '02 Proceedings of the Second International Conference on Adaptive Hypermedia and Adaptive Web-Based Systems
Improving pseudo-relevance feedback in web information retrieval using web page segmentation

WWW '03 Proceedings of the 12th international conference on World Wide Web
Automatic web news extraction using tree edit distance

Proceedings of the 13th international conference on World Wide Web
AUTOBIB: Automatic Extraction of Bibliographic Information on the Web

IDEAS '04 Proceedings of the International Database Engineering and Applications Symposium
Template-independent news extraction based on visual consistency

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2
Hierarchical hidden Markov models for information extraction

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

An adaptive bottom up Web news extraction approach based on human perception is presented in this paper. The approach simulates how a human perceives and identifies Web news information by using an adaptive bottom up clustering strategy to detect possible news areas. It first detects news areas based on content function, space continuity, and formatting continuity of news information. It further identifies detailed news content based on the position, format, and semantic of detected news areas. Experiment results show that our approach achieves much better performance (in average more than 99% in terms of F1 Value) compared to previous approaches such as Tree Edit Distance and Visual Wrapper based approaches. Furthermore, our approach does not assume the existence of Web templates in the tested Web pages as required by Tree Edit Distance based approach, nor does it need training sets as required in Visual Wrapper based approach. The success of our approach demonstrates the strength of the perception based Web information extraction methodology and represents a promising approach for automatic information extraction from sources with presentation design for humans.