List data extraction in semi-structured document

  • Authors:
  • Hui Xu;Juan-Zi Li;Peng Xu

  • Affiliations:
  • Department of Computer Science and Technology, Tsinghua University, Beijing, P.R. China;Department of Computer Science and Technology, Tsinghua University, Beijing, P.R. China;Department of Computer Science and Technology, Tsinghua University, Beijing, P.R. China

  • Venue:
  • WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

The amount of semi-structured documents is tremendous online, such as business annual reports, online airport listings, catalogs, hotel directories, etc. List, which has structured characteristics, is used to store highly structured and database-like information in many semi-structured documents. This paper is about list data extraction from semi-structured documents. By list data extraction, we mean extracting data from lists and grouping it by rows and columns. List data extraction is of benefit to text mining applications on semi-structured documents. Recently, several methods are proposed to extract list data by utilizing the word layout and arrangement information [1, 2]. However, in the research community, few previous studies has so far sufficiently investigated the problem of making use of not only layout and arrangement information, but also the semantic information of words, to the best of our knowledge. In this paper, we propose a clustering based method making use of both the layout information and the semantic information of words for this extraction task. We show experimental results on plain-text annual reports from Shanghai Stock Exchange, in which 73.49% of the lists were extracted correctly.