Varro: an algorithm and toolkit for regular structure discovery in treebanks

Authors:
Scott Martens
Affiliations:
KU Leuven
Venue:
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Year:
2010

Citing 9
Cited 0

Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Frequent Closures as a Concise Representation for Binary Data Mining

PADKK '00 Proceedings of the 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Current Issues and New Applications
Efficiently mining frequent trees in a forest

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining Closed and Maximal Frequent Subtrees from Databases of Labeled Rooted Trees

IEEE Transactions on Knowledge and Data Engineering
Canonical forms for labelled trees and their applications in frequent subtree mining

Knowledge and Information Systems
Frequent Subtree Mining - An Overview

Fundamenta Informaticae - Advances in Mining Graphs, Trees and Sequences
Capturing practical natural language transformations

Machine Translation
Quantitative analysis of treebanks using frequent subtree mining methods

TextGraphs-4 Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing
An overview of probabilistic tree transducers for natural language processing

CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Varro toolkit is a system for identifying and counting a major class of regularity in treebanks and annotated natural language data in the form of tree-structures: frequently recurring unordered subtrees. This software has been designed for use in linguistics to be maximally applicable to actually existing treebanks and other stores of tree-structurable natural language data. It minimizes memory use so that moderately large treebanks are tractable on commonly available computer hardware. This article introduces condensed canonically ordered trees as a data structure for efficiently discovering frequently recurring unordered subtrees.