An adaptive bottom up clustering approach for web news extraction

  • Authors:
  • Jinlin Chen;Subash Shankar;Angela Kelly;Serigne Gningue;Rathika Rajaravivarma

  • Affiliations:
  • Computer Science Dept., Queens College, CUNY, Flushing, NY;Hunter College, CUNY, New York, NY;Lehman College, CUNY, Bronx, NY;Lehman College, CUNY, Bronx, NY;City Tech., CUNY, Brooklyn, NY

  • Venue:
  • WOCC'09 Proceedings of the 18th international conference on Wireless and Optical Communications Conference
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

An adaptive bottom up Web news extraction approach based on human perception is presented in this paper. The approach simulates how a human perceives and identifies Web news information by using an adaptive bottom up clustering strategy to detect possible news areas. It first detects news areas based on content function, space continuity, and formatting continuity of news information. It further identifies detailed news content based on the position, format, and semantic of detected news areas. Experiment results show that our approach achieves much better performance (in average more than 99% in terms of F1 Value) compared to previous approaches such as Tree Edit Distance and Visual Wrapper based approaches. Furthermore, our approach does not assume the existence of Web templates in the tested Web pages as required by Tree Edit Distance based approach, nor does it need training sets as required in Visual Wrapper based approach. The success of our approach demonstrates the strength of the perception based Web information extraction methodology and represents a promising approach for automatic information extraction from sources with presentation design for humans.