Exploiting a proximity-based positional model to improve the quality of information extraction by text segmentation

  • Authors:
  • Dat T. Huynh;Xiaofang Zhou

  • Affiliations:
  • The University of Queensland, Australia;The University of Queensland, Australia

  • Venue:
  • ADC '13 Proceedings of the Twenty-Fourth Australasian Database Conference - Volume 137
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

A large number of web pages contain information of entities in a form of lists of field values. Those implicit semi-structured records are often available in textual sources on the web such as advertisings of products, postal addresses, bibliographic information, etc. Harvesting information of those entities from such lists of field values is challenge task because the lists are manually generated, not written in a well-defined templates or may miss some information. In this paper, we introduce a proximity-based positional model (PPM) to improve the quality of extracting information by text segmentation. Our proposed model offers improvements over the fixed-positional model proposed in ONDUX, a current state-of-art method for information extraction by text segmentation (IETS) to revise the labels of text segments in an input list of field values. Different from fixed-positional model in previous work, the key idea of PPM is to define proximity heuristic for labels in an input list in a unified language model. Our proposed model is estimated based on propagated counts of labels through a proximity-based density function. We propose and study several density functions and experimental results on different domains show that PPM is effective to revise labels and helps to improve performance of current state-of-art method.