Proceedings of the 10th international conference on World Wide Web
Function-based object model towards website adaptation
Proceedings of the 10th international conference on World Wide Web
A brief survey of web data extraction tools
ACM SIGMOD Record
Hierarchical Wrapper Induction for Semistructured Information Sources
Autonomous Agents and Multi-Agent Systems
Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Visual Based Content Understanding towards Web Adaptation
AH '02 Proceedings of the Second International Conference on Adaptive Hypermedia and Adaptive Web-Based Systems
Improving pseudo-relevance feedback in web information retrieval using web page segmentation
WWW '03 Proceedings of the 12th international conference on World Wide Web
Automatic web news extraction using tree edit distance
Proceedings of the 13th international conference on World Wide Web
AUTOBIB: Automatic Extraction of Bibliographic Information on the Web
IDEAS '04 Proceedings of the International Database Engineering and Applications Symposium
Template-independent news extraction based on visual consistency
AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2
Hierarchical hidden Markov models for information extraction
IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Hi-index | 0.00 |
An adaptive bottom up Web news extraction approach based on human perception is presented in this paper. The approach simulates how a human perceives and identifies Web news information by using an adaptive bottom up clustering strategy to detect possible news areas. It first detects news areas based on content function, space continuity, and formatting continuity of news information. It further identifies detailed news content based on the position, format, and semantic of detected news areas. Experiment results show that our approach achieves much better performance (in average more than 99% in terms of F1 Value) compared to previous approaches such as Tree Edit Distance and Visual Wrapper based approaches. Furthermore, our approach does not assume the existence of Web templates in the tested Web pages as required by Tree Edit Distance based approach, nor does it need training sets as required in Visual Wrapper based approach. The success of our approach demonstrates the strength of the perception based Web information extraction methodology and represents a promising approach for automatic information extraction from sources with presentation design for humans.