Quantitative analysis of treebanks using frequent subtree mining methods

  • Authors:
  • Scott Martens

  • Affiliations:
  • KU Leuven, Leuven, Belgium

  • Venue:
  • TextGraphs-4 Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

The first task of statistical computational linguistics, or any other type of data-driven processing of language, is the extraction of counts and distributions of phenomena. This is much more difficult for the type of complex structured data found in treebanks and in corpora with sophisticated annotation than for tokenized texts. Recent developments in data mining, particularly in the extraction of frequent subtrees from treebanks, offer some solutions. We have applied a modified version of the TreeMiner algorithm to a small treebank and present some promising results.