An information-theoretic based model for large-scale contextual text processing

  • Authors:
  • Patrick Perrin;Fred Petry

  • Affiliations:
  • -;-

  • Venue:
  • Information Sciences: an International Journal
  • Year:
  • 1999

Quantified Score

Hi-index 0.10

Visualization

Abstract

We propose a supervised systematic approach for knowledge extraction from verylarge sets of full texts, such as those retrieved by search engines over the world wide web. The proposed model is entirely data-driven and relies only on the actual content of the texts in the selected corpus. We suggest that the relative distance of terms within a text gives sufficient information about its structure and its relevant content, therefore serving as a good candidate to represent effectively the text content for knowledge elicitation tasks. We qualitatively demonstrate that a useful text structure and content can be systematically extracted by collocational lexical analysis without the need to encode any supplemental sources of knowledge. We present an algorithm that systematically recognizes the most relevant facts in the texts and labels them by their overall theme, dictated by local contextual information. It exploits domain independent lexical frequencies and mutual information measures to find the relevant contextual units in the texts. We report results from experiments in a real-world textual database of psychiatric evaluation reports.