Word association norms, mutual information, and lexicography
Computational Linguistics
Foundations of statistical natural language processing
Foundations of statistical natural language processing
Introduction to the special issue on the web as corpus
Computational Linguistics - Special issue on web as corpus
Using the web to obtain frequencies for unseen bigrams
Computational Linguistics - Special issue on web as corpus
Retrieving collocations from text: Xtract
Computational Linguistics - Special issue on using large corpora: I
Automatically extracting and representing collocations for language generation
ACL '90 Proceedings of the 28th annual meeting on Association for Computational Linguistics
COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
Methods for the qualitative evaluation of lexical association measures
ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Accurate collocation extraction using a multilingual parser
ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Collocation extraction based on modifiability statistics
COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Multilingual collocation extraction: issues and solutions
MLRI '06 Proceedings of the Workshop on Multilingual Language Resources and Interoperability
Measurements of lexico-syntactic cohesion by means of internet
MICAI'05 Proceedings of the 4th Mexican international conference on Advances in Artificial Intelligence
NLDB'09 Proceedings of the 14th international conference on Applications of Natural Language to Information Systems
Hi-index | 0.00 |
For extracting collocations from the Internet, it is necessary to numerically estimate the cohesion between potential collocates. Mutual Information cohesion measure (MI) based on numbers of collocate occurring closely together (N12) and apart (N1, N2) is well known, but the Web page statistics deprives MI of its statistical validity. We propose a family of different measures that depend on N1, N2 and N12 in a similar monotonic way and possess the scalability feature of MI. We apply the new criteria for a collection of N1, N2 and N12 obtained from AltaVista for links between a few tens of English nouns and several hundreds of their modifiers taken from Oxford Collocations Dictionary. The nounits own adjective pairs are true collocations and their measure values form one distribution. The nounalien adjective pairs are false collocations and their measure values form another distribution. The discriminating threshold is searched for to minimize the sum of probabilities for errors of two possible types. The resolving power of a criterion is equal to the minimum of the sum. The best criterion delivering minimum minimorum is found.