SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Ontology-based extraction and structuring of information from data-rich unstructured documents
Proceedings of the seventh international conference on Information and knowledge management
Record-boundary discovery in Web documents
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
DEByE - Date extraction by example
Data & Knowledge Engineering
The Debye Environment for Web Data Management
IEEE Internet Computing
Data extraction from the web based on pre-defined schema
Journal of Computer Science and Technology
Object Exchange Across Heterogeneous Information Sources
ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Automatic Wrapper Generation for Multilingual Web Resources
DS '02 Proceedings of the 5th International Conference on Discovery Science
Extracting structured data from Web pages
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Towards a wrapper-driven ontology-based framework for knowledge extraction
KSEM'07 Proceedings of the 2nd international conference on Knowledge science, engineering and management
Hi-index | 0.00 |
In this paper, a new method, named as L-tree match, is presented for extracting data from complex data sources. Firstly, based on data extraction logic presented in this work, a new data extraction model is constructed in which model components are structurally correlated via a generalized template. Secondly, a database-populating mechanism is built, along with some object-manipulating operations needed for flexible database design, to support data extraction from huge text stream. Thirdly, top-down and bottom-up strategies are combined to design a new extraction algorithm that can extract data from data sources with optional, unordered, nested, and/or noisy components. Lastly, this method is applied to extract accurate data from biological documents amounting to 100GB for the first online integrated biological data warehouse of China.