Automatically extracting web data records

  • Authors:
  • Dheerendranath Mundluru;Vijay V. Raghavan;Zonghuan Wu

  • Affiliations:
  • IMshopping Inc., Santa Clara;University of Louisiana at Lafayette, Lafayette;Huawei Technologies Corp., Santa Clara

  • Venue:
  • AMT'10 Proceedings of the 6th international conference on Active media technology
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

It is essential for Web applications such as e-commerce portals to enrich their existing content offerings by aggregating relevant structured data (e.g., product reviews) from external Web resources. To meet this goal, in this paper, we present an algorithm for automatically extracting data records from Web pages. The algorithm uses a robust string matching technique for accurately identifying the records in the Webpage. Our experiments on diverse datasets (including datasets from third-party research projects) show that the proposed algorithm is highly effective and performs considerably better than two other state-of-the-art automatic data extraction systems. We made the proposed system publicly accessible in order for the readers to evaluate it.