IRASubcat, a highly customizable, language independent tool for the acquisition of verbal subcategorization information from corpus

Authors:
Ivana Romina Altamirano;Laura Alonso i Alemany
Affiliations:
Universidad Nacional de Córdoba, Córdoba, Argentina;Universidad Nacional de Córdoba, Córdoba, Argentina
Venue:
YIWCALA '10 Proceedings of the NAACL HLT 2010 Young Investigators Workshop on Computational Approaches to Languages of the Americas
Year:
2010

Citing 7
Cited 0

Accurate methods for the statistics of surprise and coincidence

Computational Linguistics - Special issue on using large corpora: I
From grammar to lexicon: unsupervised learning of lexical syntax

Computational Linguistics - Special issue on using large corpora: II
Clustering verbs semantically according to their alternation behaviour

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
Using predicate-argument structures for information extraction

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Large-Scale Induction and Evaluation of Lexical Resources from the Penn-II and Penn-III Treebanks

Computational Linguistics
Statistical filtering and subcategorization frame acquisition

EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13
The automatic acquisition of verb subcategorisations and their impact on the performance of an HPSG parser

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

IRASubcat is a language-independent tool to acquire information about the subcategorization of verbs from corpus. The tool can extract information from corpora annotated at various levels, including almost raw text, where only verbs are identified. It can also aggregate information from a pre-existing lexicon with verbal subcategorization information. The system is highly customizable, and works with XML as input and output format. IRASubcat identifies patterns of constituents in the corpus, and associates patterns with verbs if their association strength is over a frequency threshold and passes the likelihood ratio hypothesis test. It also implements a procedure to identify verbal constituents that could be playing the role of an adjunct in a pattern. Thresholds controlling frequency and identification of adjuncts can be customized by the user, or else they are given a default value.