Extracting unstructured data from template generated web documents

  • Authors:
  • Ling Ma;Nazli Goharian;Abdur Chowdhury;Misun Chung

  • Affiliations:
  • Illinois Institute of Technology;Illinois Institute of Technology;America Online Inc.;Illinois Institute of Technology

  • Venue:
  • CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

We propose a novel approach that identifies web page templates and extracts the unstructured data. Extracting only the body of the page and eliminating the template increases the retrieval precision for the queries that generate irrelevant results. We believe that by reducing the number of irrelevant results; the users are encouraged to go back to a given site to search. Our experimental results on several different web sites and on the whole cnnfn collection demonstrate the feasibility of our approach.