A query language and optimization techniques for unstructured data
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Lore: a database management system for semistructured data
ACM SIGMOD Record
Management of semistructured data
ACM SIGMOD Record
Wrapper generation for semi-structured Internet sources
ACM SIGMOD Record
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Catching the boat with Strudel: experiences with a Web-site management system
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
An XJML-based wrapper generator for Web information extraction
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Object Exchange Across Heterogeneous Information Sources
ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
Finite-state phrase parsing by rule sequences
COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
Concept-based knowledge discovery in texts extracted from the Web
ACM SIGKDD Explorations Newsletter
Supporting unified interface to wrapper generator in integrated information retrieval
Computer Standards & Interfaces - XML Diffusion: Transfer and differentiation
Hi-index | 0.00 |
Database management systems are becoming available for semistructured data, however, these tools cannot be used on many real-world data sources (e.g., most web sites) in their native form. Often, wrappers are needed to extract information and organize it into a graph structure that makes explicit the concepts users want to query and update. This paper presents a new approach to wrapper generation that exploits linguistic knowledge. The approach produces a more fine-grained parse of sources with natural language text than previous efforts. The resulting graph structured databases answer queries that could not be formulated in database produced by prior generated wrappers. In addition, our approach may be more robust in the face of slight variations in word choice and order. We discuss a prototype implementation, lessons learned to date, evaluation issues, and future research directions.