Discovery of Frequent Word Sequences in Text

  • Authors:
  • Helena Ahonen-Myka

  • Affiliations:
  • -

  • Venue:
  • Proceedings of the ESF Exploratory Workshop on Pattern Detection and Discovery
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

We have developed a method that extracts all maximal frequent word sequences from the documents of a collection. A sequence is said to be frequent if it appears in more than 驴 documents, in which 驴 is the frequency threshold given. Furthermore, a sequence is maximal, if no other frequent sequence exists that contains this sequence. The words of a sequence do not have to appear in text consecutively.In this paper, we describe briefly the method for finding all maximal frequent word sequences in text and then extend the method for extracting generalized sequences from annotated texts, where each word has a set of additional, e.g. morphological, features attached to it. We aim at discovering patterns which preserve as many features as possible such that the frequency of the pattern still exceeds the frequency threshold given.