List data extraction in semi-structured document

Authors:
Hui Xu;Juan-Zi Li;Peng Xu
Affiliations:
Department of Computer Science and Technology, Tsinghua University, Beijing, P.R. China;Department of Computer Science and Technology, Tsinghua University, Beijing, P.R. China;Department of Computer Science and Technology, Tsinghua University, Beijing, P.R. China
Venue:
WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Year:
2005

Citing 2
Cited 0

Automated semantic annotation and retrieval based on sharable ontology and case-based learning techniques

Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Learning to cluster web search results

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

The amount of semi-structured documents is tremendous online, such as business annual reports, online airport listings, catalogs, hotel directories, etc. List, which has structured characteristics, is used to store highly structured and database-like information in many semi-structured documents. This paper is about list data extraction from semi-structured documents. By list data extraction, we mean extracting data from lists and grouping it by rows and columns. List data extraction is of benefit to text mining applications on semi-structured documents. Recently, several methods are proposed to extract list data by utilizing the word layout and arrangement information [1, 2]. However, in the research community, few previous studies has so far sufficiently investigated the problem of making use of not only layout and arrangement information, but also the semantic information of words, to the best of our knowledge. In this paper, we propose a clustering based method making use of both the layout information and the semantic information of words for this extraction task. We show experimental results on plain-text annual reports from Shanghai Stock Exchange, in which 73.49% of the lists were extracted correctly.