Text analysis meets computational lexicography

Authors:
Hannah Kermes
Affiliations:
Institut für Maschinelle Sprachverarbeitung, Stuttgart, Germany
Venue:
COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Year:
2004

Citing 3
Cited 1

Cascaded Markov Models

EACL '99 Proceedings of the ninth conference on European chapter of the Association for Computational Linguistics
TüSBL: a similarity-based chunk parser for robust syntactic processing

HLT '01 Proceedings of the first international conference on Human language technology research
Experiments in German noun chunking

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1

Significance tests for the evaluation of ranking methods

COLING '04 Proceedings of the 20th international conference on Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

More and more text corpora are available electronically. They contain information about linguistic and lexicographic properties of words, and word combinations. The amount of data is too large to extract the information manually. Thus, we need means for a (semi-)automatic processing, i.e., we need to analyse the text to be able to extract the relevant information.The question is what are the requirements for a text analysing tool, and do existing systems meet the needs of lexicographic acquisition. The hypothesis is that the better and more detailed the off-line annotation, the better and faster the on-line extraction. However, the more detailed the off-line annotation, the more complex the grammar, the more time consuming and difficult the grammar development, and the slower the parsing process.For the application as an analyzing tool in computational lexicography a symbolic chunker with a hand-written grammar seems to be a good choice. The available chunkers for German, however, do not consider all of the additional information needed for this task such as head lemma, morpho-syntactic information, and lexical or semantic properties, which are useful if not necessary for extraction processes. Thus, we decided to build a recursive chunker for unrestricted German text within the framework of the IMS Corpus Workbench (CWB).