L-tree match: a new data extraction model and algorithm for huge text stream with noises

Authors:
Xu-Bin Deng;Yang-Yong Zhu
Affiliations:
Department of Computing and Information Technology, Fudan University, Shanghai, P.R. China;Department of Computing and Information Technology, Fudan University, Shanghai, P.R. China and Shanghai Center for Bioinformation Technology, Shanghai, P.R. China
Venue:
Journal of Computer Science and Technology
Year:
2005

Citing 10
Cited 1

NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Ontology-based extraction and structuring of information from data-rich unstructured documents

Proceedings of the seventh international conference on Information and knowledge management
Record-boundary discovery in Web documents

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
DEByE - Date extraction by example

Data & Knowledge Engineering
The Debye Environment for Web Data Management

IEEE Internet Computing
Data extraction from the web based on pre-defined schema

Journal of Computer Science and Technology
Object Exchange Across Heterogeneous Information Sources

ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Automatic Wrapper Generation for Multilingual Web Resources

DS '02 Proceedings of the 5th International Conference on Discovery Science
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data

Towards a wrapper-driven ontology-based framework for knowledge extraction

KSEM'07 Proceedings of the 2nd international conference on Knowledge science, engineering and management

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, a new method, named as L-tree match, is presented for extracting data from complex data sources. Firstly, based on data extraction logic presented in this work, a new data extraction model is constructed in which model components are structurally correlated via a generalized template. Secondly, a database-populating mechanism is built, along with some object-manipulating operations needed for flexible database design, to support data extraction from huge text stream. Thirdly, top-down and bottom-up strategies are combined to design a new extraction algorithm that can extract data from data sources with optional, unordered, nested, and/or noisy components. Lastly, this method is applied to extract accurate data from biological documents amounting to 100GB for the first online integrated biological data warehouse of China.