Discovering Compound and Proper Nouns

  • Authors:
  • Grzegorz Protaziuk;Marzena Kryszkiewicz;Henryk Rybinski;Alexandre Delteil

  • Affiliations:
  • ICS, Warsaw University of Technology,;ICS, Warsaw University of Technology,;ICS, Warsaw University of Technology,;France Telecome R & D,

  • Venue:
  • RSEISP '07 Proceedings of the international conference on Rough Sets and Intelligent Systems Paradigms
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

The identification of appropriate text tokens (words or sequences of words representing concepts) is one of the most important tasks of text preprocessing and may have great influence on the final results of text analysis. In our paper, we introduce a new approach to discovering compound nouns, including proper compound nouns. Our approach combines the data mining methods with shallow lexical analysis. We propose a simple pattern language for specifying grammatical patterns to be satisfied by extracted compound nouns. Our method requires annotating the words with part of speech tags, thus to this extent, it is language-dependent. Based on the data mining GSPalgorithm, we propose T-GSPas its modification for extracting frequent text patterns, and in particular, frequent word sequences that satisfy given grammatical rules. The obtained sequences are regarded as candidates for compound nouns. The experiments have proven very high quality of the method.