Unsupervised strategies for information extraction by text segmentation

  • Authors:
  • Eli Cortez;Altigran S. da Silva

  • Affiliations:
  • Universidade Federal do Amazonas, Manaus, AM, Brazil;Universidade Federal do Amazonas, Manaus, AM, Brazil

  • Venue:
  • Proceedings of the Fourth SIGMOD PhD Workshop on Innovative Database Research
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Information extraction by text segmentation (IETS) applies to cases in which data values of interest are organized in implicit semi-structured records available in textual sources (e.g. postal addresses, bibliographic information, ads). It is an important practical problem that has been frequently addressed in the recent literature. We report here partial results from a PhD thesis work in which we introduce ONDUX (On Demand Unsupervised Information Extraction), a new unsupervised probabilistic approach for IETS. As other unsupervised IETS approaches, ONDUX relies on information available on pre-existing data to associate segments in the input string with attributes of a given domain. Unlike other approaches, we rely on very effective matching strategies instead of explicit learning strategies. The effectiveness of this matching strategy is also exploited to disambiguate the extraction of certain attributes through a reinforcement step that explores sequencing and positioning of attribute values directly learned on-demand from test data, with no previous human-driven training, a feature unique to ONDUX. This assigns to ONDUX a high degree of flexibility and results in superior effectiveness, as demonstrated by experimental evaluation we have carried out with textual sources from different domains, in which ONDUX is compared with a state-of-art IETS approach.