Extracting the lowest-frequency words: pitfalls and possibilities

Authors:
Marc Weeber;R. Harald Baayen;Rein Vos
Affiliations:
University of Groningen;Max Planck Institute for Psycholinguistics;University of Groningen, University of Maastricht
Venue:
Computational Linguistics
Year:
2000

Citing 8
Cited 7

Word association norms, mutual information, and lexicography

Computational Linguistics
A remark on algorithm 643: FEXACT: an algorithm for performing Fisher's exact test in r x c contingency tables

ACM Transactions on Mathematical Software (TOMS)
ALGORITHM 643: FEXACT: a FORTRAN subroutine for Fisher's exact test on unordered r×c contingency tables

ACM Transactions on Mathematical Software (TOMS)
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Information Retrieval

Information Retrieval
Accurate methods for the statistics of surprise and coincidence

Computational Linguistics - Special issue on using large corpora: I
Retrieving collocations from text: Xtract

Computational Linguistics - Special issue on using large corpora: I
Significant lexical relationships

AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 1

PAI: automatic indexing for extracting asserted keywords from a document

New Generation Computing - Special issue on chance discovery
Accessor variety criteria for Chinese word extraction

Computational Linguistics
Text characteristics of English language university Web sites: Research Articles

Journal of the American Society for Information Science and Technology
Methods for the qualitative evaluation of lexical association measures

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Learning Subjective Language

Computational Linguistics
Drug Discovery as an Example of Literature-Based Discovery

Computational Discovery of Scientific Knowledge
Exploiting extremely rare features in text categorization

ECML'06 Proceedings of the 17th European conference on Machine Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

In a medical information extraction system, we use common word association techniques to extract side-effect-related terms. Many of these terms have a frequency of less than five. Standard word-association-based applications disregard the lowest-frequency words, and hence disregard useful information. We therefore devised an extraction system for the full word frequency range. This system computes the significance of association by the log-likelihood ratio and Fisher's exact test. The output of the system shows a recurrent, corpus-independent pattern in both recall and the number of significant words. We will explain these patterns by the statistical behavior of the lowest-frequency words. We used Dutch verb-particle combinations as a second and independent collocation extraction application to illustrate the generality of the observed phenomena. We will conclude that a) word-association-based extraction systems can be enhanced by also considering the lowest-frequency words, b) significance levels should not be fixed but adjusted for the optimal window size, c) hapax legomena, words occurring only once, should be disregarded a priori in the statistical analysis, and d) the distribution of the targets to extract should be considered in combination with the extraction method.