Multiway-tree retrieval based on treegrams

Authors:
Hans Argenton;Ulrich Güntzer
Affiliations:
Wilhelm-Schickard-Institut für Informatik, Universität Tübingen, Tübingen, Germany;Wilhelm-Schickard-Institut für Informatik, Universität Tübingen, Tübingen, Germany
Venue:
ADBIS'97 Proceedings of the First East-European conference on Advances in Databases and Information systems
Year:
1997

Citing 4
Cited 0

Faster tree pattern matching

Journal of the ACM (JACM)
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Efficient tree pattern matching

SFCS '89 Proceedings of the 30th Annual Symposium on Foundations of Computer Science
n-Gram Statistics for Natural Language Understanding and Text Processing

IEEE Transactions on Pattern Analysis and Machine Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large tree databases as knowledge repositories become more and more important; a prominent example are the treebanks in computational linguistics: text corpora consisting of up to five million words tagged with syntactic information. Consequently, these large amounts of structured data pose the problem of fast tree retrieval: Given a database T of labeled multiway trees and a query tree q, find efficiently all trees t ∈ T that contain q as subtree. This paper presents a generalization of the classical n-gram indexing technique for supporting fast retrieval of multiway tree structures: Treegram indexing covers database trees with subtrees of fixed height; each entry of the resulting index represents such a subtree together with the database trees that contain this subtree. The evaluation of a given query q preselects those database trees that contain all of q's cover trees and, in turn, tests these candidates rigorously for containment of q. As an application of treegram indexing, we describe the VENONA retrieval system, which handles the BHt treebank containing 508,650 phrase structure trees found in the morphosyntactical analysis of The Old Testament with altogether 3.3 million wordforms--results of a computational-linguistics project at the Ludwig-Maximilian's University of Munich.