A statistical method for extracting uninterrupted and interrupted collocations from very large corpora

Authors:
Satoru Ikehara;Satoshi Shirai;Hajime Uchino
Affiliations:
NTT Communication Science Laboratories, Yokoshuka-shi, Japan;NTT Communication Science Laboratories, Yokoshuka-shi, Japan;NTT Communication Science Laboratories, Yokoshuka-shi, Japan
Venue:
COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
Year:
1996

Citing 5
Cited 8

Word association norms, mutual information, and lexicography

Computational Linguistics
Retrieving collocations from text: Xtract

Computational Linguistics - Special issue on using large corpora: I
Automatically extracting and representing collocations for language generation

ACL '90 Proceedings of the 28th annual meeting on Association for Computational Linguistics
N-gram cluster identification during empirical knowledge representation generation

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 2
A new method of N-gram statistics for large number of n and automatic extraction of words and phrases from large text data of Japanese

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1

Unit Completion for a Computer-aided Translation Typing System

Machine Translation
Unit completion for a computer-aided translation typing system

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Retrieving collocations by co-occurrences and word order constraints

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Large scale collocation data and their application to Japanese word processor technology

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Learning bilingual collocations by word-level sorting

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
Collocation extraction based on modifiability statistics

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Various criteria of collocation cohesion in internet: comparison of resolving power

CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing
Distant collocations between suppositional adverbs and clause-final modality forms in Japanese language corpora

LKR'08 Proceedings of the 3rd international conference on Large-scale knowledge resources: construction and application

Quantified Score

Hi-index	0.00

Visualization

Abstract

In order to extract rigid expressions with a high frequency of use, new algorithm that can efficiently extract both uninterrupted and interrupted collocations from very large corpora has been proposed.The statistical method recently proposed for calculating N-gram of arbitrary N can be applied to the extraction of uninterrupted collocations. But this method posed problems that so large volumes of fractional and unnecessary expressions are extracted that it was impossible to extract interrupted collocations combining the results. To solve this problem, this paper proposed a new algorithm that restrains extraction of unnecessary substrings. This is followed by the proposal of a method that enable to extract interrupted collocations.The new methods are applied to Japanese newspaper articles involving 8.92 million characters. In the case of uninterrupted collocations with string length of 2 or mere characters and frequency of appearance 2 or more times, there were 4.4 millions types of expressions (total frequency of 31.2 millions times) extracted by the N-gram method. In contrast, the new method has reduced this to 0.97 million types (total frequency of 2.6 million times) revealing a substantial reduction in fractional and unnecessary expressions. In the case of interrupted collocational substring extraction, combining the substring with frequency of 10 times or more extracted by the first method, 6.5 thousand types of pairs of substrings with the total frequency of 21.8 thousands were extracted.