Estimating lexical priors for low-frequency morphologically ambiguous forms

Authors:
Harald Baayen;Richard Sproat
Affiliations:
Max Planck Institute for Psycholinguistics;Bell Laboratories
Venue:
Computational Linguistics
Year:
1996

Citing 4
Cited 11

Grammatical category disambiguation by statistical optimization

Computational Linguistics
A stochastic parts program and noun phrase parser for unrestricted text

ANLC '88 Proceedings of the second conference on Applied natural language processing
Decision lists for lexical ambiguity resolution: application to accent restoration in Spanish and French

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
Word-sense disambiguation using statistical models of Roget's categories trained on large corpora

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 2

The effects of lexical specialization on the growth curve of the vocabulary

Computational Linguistics
The Balancing Act, Judith L. Klavans and Philip Resnik

Journal of Logic, Language and Information
Tagging with Small Training Corpora

IDA '01 Proceedings of the 4th International Conference on Advances in Intelligent Data Analysis
Automatic rule induction for unknown-word guessing

Computational Linguistics
Lexical rules in constraint-based grammars

Computational Linguistics
A part of speech estimation method for Japanese unknown words using a statistical model of morphology and context

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Guessing parts-of-speech of unknown words using global information

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Robust PCFG-based generation using automatically acquired LFG approximations

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
An error-driven word-character hybrid model for joint Chinese word segmentation and POS tagging

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Accurate and robust LFG-based generation for Chinese

INLG '08 Proceedings of the Fifth International Natural Language Generation Conference
Predicting the semantic compositionality of prefix verbs

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Given a form that is previously unseen in a sufficiently large training corpus, and that is morphologically n-ways ambiguous (serves n different lexical functions) what is the best estimator for the lexical prior probabilities for the various functions of the form? We argue that the best estimator is provided by computing the relative frequencies of the various functions among the hapax legomena---the forms that occur exactly once in a corpus; in particular, a hapax-based estimator is better than one based on the proportion of the various functions among words of all frequency ranges. As we shall argue, this is because when one computes an overall measure, one is including high-frequency words, and high-frequency words tend to have idiosyncratic properties that are not at all representative of the much larger mass of (productively formed) low-frequency words. This result has potential importance for various kinds of applications requiring lexical disambiguation, including, in particular, stochastic taggers. This is especially true when some initial hand-tagging of a corpus is required: for predicting lexical priors for very low-frequency morphologically ambiguous types (most of which would not occur in any given corpus), one should concentrate on tagging a good representative sample of the hapax legomena, rather than extensively tagging words of all frequency ranges.