Word association norms, mutual information, and lexicography
Computational Linguistics
Highlights: language- and domain-independent automatic indexing terms for abstracting
Journal of the American Society for Information Science
Pivoted document length normalization
SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Accurate methods for the statistics of surprise and coincidence
Computational Linguistics - Special issue on using large corpora: I
A method of measuring term representativeness: baseline method using co-occurrence distribution
COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Fast computation of lexical affinity models
COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Hot Topic Extraction Based on Timeline Analysis and Multidimensional Sentence Modeling
IEEE Transactions on Knowledge and Data Engineering
Context-Based Text Mining for Insights in Long Documents
PAKM '08 Proceedings of the 7th International Conference on Practical Aspects of Knowledge Management
Getting insights from the voices of customers: Conversation mining at a contact center
Information Sciences: an International Journal
Chinese Terminology Extraction Using Window-Based Contextual Information
CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing
Chinese term extraction using minimal resources
COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
A delimiter-based general approach for Chinese term extraction
Journal of the American Society for Information Science and Technology
Hi-index | 0.01 |
We propose a novel measure of the representativeness (i.e., indicativeness or topic specificity) of a term in a given corpus. The measure embodies the idea that the distribution of words co-occurring with a representative term should be biased according to the word distribution in the whole corpus. The bias of the word distribution in the co-occurring words is defined as the number of distinct words whose occurrences are saliently biased in the co-occurring words. The saliency of a word is defined by a threshold probability that can be automatically defined using the whole corpus. Comparative evaluation clarified that the measure is clearly superior to conventional measures in finding topic-specific words in the newspaper archives of different sizes.