Quantitative analysis of treebanks using frequent subtree mining methods

Authors:
Scott Martens
Affiliations:
KU Leuven, Leuven, Belgium
Venue:
TextGraphs-4 Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing
Year:
2009

Citing 12
Cited 1

Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
Efficiently mining frequent trees in a forest

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Data-Oriented Parsing

Data-Oriented Parsing
Data-Oriented Translation

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
Mining Closed and Maximal Frequent Subtrees from Databases of Labeled Rooted Trees

IEEE Transactions on Knowledge and Data Engineering
Canonical forms for labelled trees and their applications in frequent subtree mining

Knowledge and Information Systems
Learning surface text patterns for a Question Answering system

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Speech and Language Processing (2nd Edition)

Speech and Language Processing (2nd Edition)
Frequent Subtree Mining - An Overview

Fundamenta Informaticae - Advances in Mining Graphs, Trees and Sequences
Linear pattern matching algorithms

SWAT '73 Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973)
Automatic generation of parallel treebanks

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1

Varro: an algorithm and toolkit for regular structure discovery in treebanks

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters

Quantified Score

Hi-index	0.00

Visualization

Abstract

The first task of statistical computational linguistics, or any other type of data-driven processing of language, is the extraction of counts and distributions of phenomena. This is much more difficult for the type of complex structured data found in treebanks and in corpora with sophisticated annotation than for tokenized texts. Recent developments in data mining, particularly in the extraction of frequent subtrees from treebanks, offer some solutions. We have applied a modified version of the TreeMiner algorithm to a small treebank and present some promising results.