Word association norms, mutual information, and lexicography
Computational Linguistics
NACSIS test collection workshop (NTCIR-1) (poster abstract)
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
SIGIR '80 Proceedings of the 3rd annual ACM conference on Research and development in information retrieval
Accurate methods for the statistics of surprise and coincidence
Computational Linguistics - Special issue on using large corpora: I
Towards automatic extraction of monolingual and bilingual terminology
COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1
Information Access Based on Associative Calculation
SOFSEM '00 Proceedings of the 27th Conference on Current Trends in Theory and Practice of Informatics
Experimental study of discovering essential information from customer inquiry
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
A measure of term representativeness based on the number of co-occurring salient words
COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
A simple but powerful automatic term extraction method
COMPUTERM '02 COLING-02 on COMPUTERM 2002: second international workshop on computational terminology - Volume 14
A probabilistic framework for automatic term recognition
Intelligent Data Analysis
Using web resources for support of online-browsing of research papers
IRI'09 Proceedings of the 10th IEEE international conference on Information Reuse & Integration
A novel topic model for automatic term extraction
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Hi-index | 0.01 |
This paper introduces a scheme, which we call the baseline method, to define a measure of term representativeness and measures defined by using the scheme. The representativeness of a term is measured by a normalized characteristic value defined for a set of all documents that contain the term. Normalization is done by comparing the original characteristic value with the characteristic value defined for a randomly chosen document set of the same size. The latter value is estimated by a baseline function obtained by random sampling and logarithmic linear approximation. We found that the distance between the word distribution in a document set and the word distribution in a whole corpus is an effective characteristic value to use for the baseline method. Measures defined by the baseline method have several advantages including that they can be used to compare the representativeness of two terms with very different frequencies, and that they have well-defined threshold values of being representative. In addition, the baseline function for a corpus is robust against differences in corpora; that is, it can be used for normalization in a different corpus that has a different size or is in a different domain.