L-tree match: a new data extraction model and algorithm for huge text stream with noises

  • Authors:
  • Xu-Bin Deng;Yang-Yong Zhu

  • Affiliations:
  • Department of Computing and Information Technology, Fudan University, Shanghai, P.R. China;Department of Computing and Information Technology, Fudan University, Shanghai, P.R. China and Shanghai Center for Bioinformation Technology, Shanghai, P.R. China

  • Venue:
  • Journal of Computer Science and Technology
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, a new method, named as L-tree match, is presented for extracting data from complex data sources. Firstly, based on data extraction logic presented in this work, a new data extraction model is constructed in which model components are structurally correlated via a generalized template. Secondly, a database-populating mechanism is built, along with some object-manipulating operations needed for flexible database design, to support data extraction from huge text stream. Thirdly, top-down and bottom-up strategies are combined to design a new extraction algorithm that can extract data from data sources with optional, unordered, nested, and/or noisy components. Lastly, this method is applied to extract accurate data from biological documents amounting to 100GB for the first online integrated biological data warehouse of China.