Template-based wrappers in the TSIMMIS system
SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
A scalable comparison-shopping agent for the World-Wide Web
AGENTS '97 Proceedings of the first international conference on Autonomous agents
Modeling Web sources for information integration
AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Clean up your Web pages with HP's HTML tidy
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Data on the Web: from relations to semistructured data and XML
Data on the Web: from relations to semistructured data and XML
Visual Web Information Extraction with Lixto
Proceedings of the 27th International Conference on Very Large Data Bases
XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources
ICDE '00 Proceedings of the 16th International Conference on Data Engineering
L-tree match: a new data extraction model and algorithm for huge text stream with noises
Journal of Computer Science and Technology
Ontology-based HTML to XML conversion
WAIM'05 Proceedings of the 6th international conference on Advances in Web-Age Information Management
Web service-based study on BPM integrated application for aero-manufacturing
APWeb'06 Proceedings of the 2006 international conference on Advanced Web and Network Technologies, and Applications
DEXA'05 Proceedings of the 16th international conference on Database and Expert Systems Applications
Hi-index | 0.00 |
With the development of the Internet, the World Wide Web has become an invaluable information source for most organizations. However, most documents available from the Web are in HTML form which is originally designed for document formatting with little consideration of its contents. Effectively extracting data from such documents remains a non-trivial task. In this paper, we present a schema-guided approach to extracting data from HTML pages. Under the approach, the user defines a schema specifying what to be extracted and provides sample mappings between the schema and the HTML page. The system will induce the mapping rules and generate a wrapper that takes the HTML page as input and produces the required data in the form of XML conforming to the user-defined schema. A prototype system implementing the approach has been developed. The preliminary experiments indicate that the proposed semi-automatic approach is not only easy to use but also able to produce a wrapper that extracts required data from inputted pages with high accuracy.