Automatically extracting web data records

Authors:
Dheerendranath Mundluru;Vijay V. Raghavan;Zonghuan Wu
Affiliations:
IMshopping Inc., Santa Clara;University of Louisiana at Lafayette, Lafayette;Huawei Technologies Corp., Santa Clara
Venue:
AMT'10 Proceedings of the 6th international conference on Active media technology
Year:
2010

Citing 8
Cited 1

A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
Approximate String Matching

ACM Computing Surveys (CSUR)
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
A Fully Automated Object Extraction System for the World Wide Web

ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Fully automatic wrapper generation for search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
Experiences in crawling deep web in the context of local search

Proceedings of the 2nd international workshop on Geographic information retrieval
Automatically constructing wrappers for effective and efficient web information extraction

Automatically constructing wrappers for effective and efficient web information extraction

EMTAN: a web-based multi-agent system architecture for input automation

AMT'11 Proceedings of the 7th international conference on Active media technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

It is essential for Web applications such as e-commerce portals to enrich their existing content offerings by aggregating relevant structured data (e.g., product reviews) from external Web resources. To meet this goal, in this paper, we present an algorithm for automatically extracting data records from Web pages. The algorithm uses a robust string matching technique for accurately identifying the records in the Webpage. Our experiments on diverse datasets (including datasets from third-party research projects) show that the proposed algorithm is highly effective and performs considerably better than two other state-of-the-art automatic data extraction systems. We made the proposed system publicly accessible in order for the readers to evaluate it.