Discovery of Frequent Word Sequences in Text

Authors:
Helena Ahonen-Myka
Affiliations:
-
Venue:
Proceedings of the ESF Exploratory Workshop on Pattern Detection and Discovery
Year:
2002

Citing 2
Cited 8

Fast discovery of association rules

Advances in knowledge discovery and data mining
Mining Sequential Patterns

ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering

Extracting unstructured data from template generated web documents

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Text document clustering based on frequent word meaning sequences

Data & Knowledge Engineering
Discovering Compound and Proper Nouns

RSEISP '07 Proceedings of the international conference on Rough Sets and Intelligent Systems Paradigms
Discovering Synonyms Based on Frequent Termsets

RSEISP '07 Proceedings of the international conference on Rough Sets and Intelligent Systems Paradigms
Authorship attribution using word sequences

CIARP'06 Proceedings of the 11th Iberoamerican conference on Progress in Pattern Recognition, Image Analysis and Applications
A text mining approach for definition question answering

FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
MFSRank: an unsupervised method to extract keyphrases using semantic information

MICAI'11 Proceedings of the 10th Mexican international conference on Advances in Artificial Intelligence - Volume Part I
On the evaluation and improvement of Arabic WordNet coverage and usability

Language Resources and Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

We have developed a method that extracts all maximal frequent word sequences from the documents of a collection. A sequence is said to be frequent if it appears in more than 驴 documents, in which 驴 is the frequency threshold given. Furthermore, a sequence is maximal, if no other frequent sequence exists that contains this sequence. The words of a sequence do not have to appear in text consecutively.In this paper, we describe briefly the method for finding all maximal frequent word sequences in text and then extend the method for extracting generalized sequences from annotated texts, where each word has a set of additional, e.g. morphological, features attached to it. We aim at discovering patterns which preserve as many features as possible such that the frequency of the pattern still exceeds the frequency threshold given.