A method of measuring term representativeness: baseline method using co-occurrence distribution

Authors:
Toru Hisamitsu;Yoshiki Niwa;Jun-ichi Tsujii
Affiliations:
Central Research Laboratory, Hitachi, Ltd., Saitama, Japan;Central Research Laboratory, Hitachi, Ltd., Saitama, Japan;University of Tokyo, Tokyo, Japan
Venue:
COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Year:
2000

Citing 5
Cited 7

Word association norms, mutual information, and lexicography

Computational Linguistics
NACSIS test collection workshop (NTCIR-1) (poster abstract)

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
A performance evaluation of similarity measures, document term weighting schemes and representations in a Boolean environment

SIGIR '80 Proceedings of the 3rd annual ACM conference on Research and development in information retrieval
Accurate methods for the statistics of surprise and coincidence

Computational Linguistics - Special issue on using large corpora: I
Towards automatic extraction of monolingual and bilingual terminology

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1

Information Access Based on Associative Calculation

SOFSEM '00 Proceedings of the 27th Conference on Current Trends in Theory and Practice of Informatics
Experimental study of discovering essential information from customer inquiry

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
A measure of term representativeness based on the number of co-occurring salient words

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
A simple but powerful automatic term extraction method

COMPUTERM '02 COLING-02 on COMPUTERM 2002: second international workshop on computational terminology - Volume 14
A probabilistic framework for automatic term recognition

Intelligent Data Analysis
Using web resources for support of online-browsing of research papers

IRI'09 Proceedings of the 10th IEEE international conference on Information Reuse & Integration
A novel topic model for automatic term extraction

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper introduces a scheme, which we call the baseline method, to define a measure of term representativeness and measures defined by using the scheme. The representativeness of a term is measured by a normalized characteristic value defined for a set of all documents that contain the term. Normalization is done by comparing the original characteristic value with the characteristic value defined for a randomly chosen document set of the same size. The latter value is estimated by a baseline function obtained by random sampling and logarithmic linear approximation. We found that the distance between the word distribution in a document set and the word distribution in a whole corpus is an effective characteristic value to use for the baseline method. Measures defined by the baseline method have several advantages including that they can be used to compare the representativeness of two terms with very different frequencies, and that they have well-defined threshold values of being representative. In addition, the baseline function for a corpus is robust against differences in corpora; that is, it can be used for normalization in a different corpus that has a different size or is in a different domain.