A measure of term representativeness based on the number of co-occurring salient words

Authors:
Toru Hisamitsu;Yoshiki Niwa
Affiliations:
Central Research Laboratory, Hitachi, Ltd., Hatoyama, Saitama, Japan;Central Research Laboratory, Hitachi, Ltd., Hatoyama, Saitama, Japan
Venue:
COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Year:
2002

Citing 5
Cited 7

Word association norms, mutual information, and lexicography

Computational Linguistics
Highlights: language- and domain-independent automatic indexing terms for abstracting

Journal of the American Society for Information Science
Pivoted document length normalization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Accurate methods for the statistics of surprise and coincidence

Computational Linguistics - Special issue on using large corpora: I
A method of measuring term representativeness: baseline method using co-occurrence distribution

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1

Fast computation of lexical affinity models

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Hot Topic Extraction Based on Timeline Analysis and Multidimensional Sentence Modeling

IEEE Transactions on Knowledge and Data Engineering
Context-Based Text Mining for Insights in Long Documents

PAKM '08 Proceedings of the 7th International Conference on Practical Aspects of Knowledge Management
Getting insights from the voices of customers: Conversation mining at a contact center

Information Sciences: an International Journal
Chinese Terminology Extraction Using Window-Based Contextual Information

CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing
Chinese term extraction using minimal resources

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
A delimiter-based general approach for Chinese term extraction

Journal of the American Society for Information Science and Technology

Quantified Score

Hi-index	0.01

Visualization

Abstract

We propose a novel measure of the representativeness (i.e., indicativeness or topic specificity) of a term in a given corpus. The measure embodies the idea that the distribution of words co-occurring with a representative term should be biased according to the word distribution in the whole corpus. The bias of the word distribution in the co-occurring words is defined as the number of distinct words whose occurrences are saliently biased in the co-occurring words. The saliency of a word is defined by a threshold probability that can be automatically defined using the whole corpus. Comparative evaluation clarified that the measure is clearly superior to conventional measures in finding topic-specific words in the newspaper archives of different sizes.