Web data extracion using visual features

  • Authors:
  • V. Padmadas;J. Gadge

  • Affiliations:
  • Thadomal Shahani Engg College, Mumbai;Thadomal Shahani Engg College, Mumbai

  • Venue:
  • Proceedings of the International Conference and Workshop on Emerging Trends in Technology
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Automatic data extraction from Web pages is a challenging yet significant problem in the fields of Information Retrieval and Data Mining. The problem arises particularly on the World-Wide Web, because search engines wrap up the results of user queries on web response pages. These response pages are often decorated with side bars, branding banners and advertisements. Automatic data extraction therefore has to deal with extracting relevant data from these pages Though many automated and manual text analysis solutions to this problem exist, most of them are heavily dependent on the specifics of HTML and they have to be changed according to the changes in markup language. This paper proposes, a novel and language independent technique to solve the data extraction problem using a combined approach that make use of features of DOM tree and also the visual features of html elements.