Grammatical category disambiguation by statistical optimization
Computational Linguistics
A stochastic parts program and noun phrase parser for unrestricted text
ANLC '88 Proceedings of the second conference on Applied natural language processing
ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
Word-sense disambiguation using statistical models of Roget's categories trained on large corpora
COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 2
The effects of lexical specialization on the growth curve of the vocabulary
Computational Linguistics
The Balancing Act, Judith L. Klavans and Philip Resnik
Journal of Logic, Language and Information
Tagging with Small Training Corpora
IDA '01 Proceedings of the 4th International Conference on Advances in Intelligent Data Analysis
Automatic rule induction for unknown-word guessing
Computational Linguistics
Lexical rules in constraint-based grammars
Computational Linguistics
ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Guessing parts-of-speech of unknown words using global information
ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Robust PCFG-based generation using automatically acquired LFG approximations
ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
An error-driven word-character hybrid model for joint Chinese word segmentation and POS tagging
ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Accurate and robust LFG-based generation for Chinese
INLG '08 Proceedings of the Fifth International Natural Language Generation Conference
Predicting the semantic compositionality of prefix verbs
EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Hi-index | 0.00 |
Given a form that is previously unseen in a sufficiently large training corpus, and that is morphologically n-ways ambiguous (serves n different lexical functions) what is the best estimator for the lexical prior probabilities for the various functions of the form? We argue that the best estimator is provided by computing the relative frequencies of the various functions among the hapax legomena---the forms that occur exactly once in a corpus; in particular, a hapax-based estimator is better than one based on the proportion of the various functions among words of all frequency ranges. As we shall argue, this is because when one computes an overall measure, one is including high-frequency words, and high-frequency words tend to have idiosyncratic properties that are not at all representative of the much larger mass of (productively formed) low-frequency words. This result has potential importance for various kinds of applications requiring lexical disambiguation, including, in particular, stochastic taggers. This is especially true when some initial hand-tagging of a corpus is required: for predicting lexical priors for very low-frequency morphologically ambiguous types (most of which would not occur in any given corpus), one should concentrate on tagging a good representative sample of the hapax legomena, rather than extensively tagging words of all frequency ranges.