Complex Terminology Extraction Model from Unstructured Web Text Based Linguistic and Statistical Knowledge

  • Authors:
  • Fethi Fkih;Mohamed Nazih Omri

  • Affiliations:
  • MARS Research Unit, Faculty of sciences of Monastir, University of Monastir, Monastir, Tunisia;MARS Research Unit, Faculty of sciences of Monastir, University of Monastir, Monastir, Tunisia

  • Venue:
  • International Journal of Information Retrieval Research
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Textual data remain the most interesting source of information in the web. In the authors' research, they focus on a very specific kind of information namely "complex terms". Indeed, complex terms are defined as semantic units composed of several lexical units that can describe in a relevant and exhaustive way the text content. In this paper, they present a new model for complex terminology extraction COTEM, which integrates linguistic and statistical knowledge. Thus, the authors try to focus on three main contributions: firstly, they show the possibility of using a linear Conditional Random Fields CRF for complex terminology extraction from a specialized text corpus. Secondly, prove the ability of a Conditional Random Field to model linguistic knowledge by incorporating grammatical observations in the CRF's features. Finally, the authors present the benefits gained by the integration of statistical knowledge on the quality of the terminology extraction.