Data Extraction From Repositories On The Web: A Semi-Automatic Approach

Authors:
Coşkun Bayrak;Hayrettin Kolukísaoğlu;Steve Sieloff
Affiliations:
Computer Science Department, University of Arkansas at Little Rock, Little Rock, AR, U.S.A.;Computer Science Department, University of Arkansas at Little Rock, Little Rock, AR, U.S.A.;Acxiom Corporation, Little Rock, AR, U.S.A.
Venue:
Journal of Integrated Design & Process Science
Year:
2003

Citing 8
Cited 0

Template-based wrappers in the TSIMMIS system

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
A brief survey of web data extraction tools

ACM SIGMOD Record
Hierarchical Wrapper Induction for Semistructured Information Sources

Autonomous Agents and Multi-Agent Systems
To Weave the Web

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Semi-Automatic Wrapper Generation for Internet Information Sources

COOPIS '97 Proceedings of the Second IFCIS International Conference on Cooperative Information Systems
A Conceptual-Modeling Approach to Extracting Data from the Web

ER '98 Proceedings of the 17th International Conference on Conceptual Modeling
Semistructured data: the TSIMMIS experience

ADBIS'97 Proceedings of the First East-European conference on Advances in Databases and Information systems

Quantified Score

Hi-index	0.01

Visualization

Abstract

The World Wide Web (WWW) is becoming the most important source of information for business intelligence and information dissemination. Past information gathering techniques like surfing and sifting are proving insufficient in processing the vast volumes of data readily available from the Web. In addition, companies are being forced to integrate this vast data repository within specific cost, time, and reliability spectrums. This paper presents the fundamentals of a system called "Browser Harness" (B2H) that extracts the requested data from Web sites in a supervised fashion. The algorithmic background of this system is based on the tag structure of web pages, as HTML is the predominate choice for rendering web page content on the WWW. B2H is an interactive tool for harnessing data from semi-structured and structured web pages by analyzing the tag structure of the input page and locating the data in the HTML code. The extracted data is then exported to XML, delimited text, or database tables.