A new method of N-gram statistics for large number of n and automatic extraction of words and phrases from large text data of Japanese

Authors:
Makoto Nagao;Shinsuke Mori
Affiliations:
Kyoto University;Kyoto University
Venue:
COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1
Year:
1994

Citing 0
Cited 27

Dynamic programming: a method for taking advantage of technical terminology in Japanese documents

IRAL '00 Proceedings of the fifth international workshop on on Information retrieval with Asian languages
Glossary-Based MT Engines in a Multilingual Analyst‘s Workstation Architecture

Machine Translation
Unit Completion for a Computer-aided Translation Typing System

Machine Translation
Evaluation of Number-Kanji Translation Method of Non-Segmented Japanese Sentences Using Inductive Learning with Degenerated Input

AI '99 Proceedings of the 12th Australian Joint Conference on Artificial Intelligence: Advanced Topics in Artificial Intelligence
Searching large text collections

Handbook of massive data sets
MARSYAS: a framework for audio analysis

Organised Sound
Mostly-unsupervised statistical segmentation of Japanese Kanji sequences

Natural Language Engineering
Compound noun segmentation based on lexical data extracted from corpus

Natural Language Engineering
Unit completion for a computer-aided translation typing system

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Compound noun segmentation based on lexical data extracted from corpus

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Mostly-unsupervised statistical segmentation of Japanese: applications to kanji

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Retrieving collocations by co-occurrences and word order constraints

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
The effects of word order and segmentation on translation retrieval performance

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Automatic semantic sequence extraction from unrestricted non-tagged texts

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Extracting nested collocations

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
Learning bilingual collocations by word-level sorting

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
A statistical method for extracting uninterrupted and interrupted collocations from very large corpora

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
Automatic corpus-based Thai word extraction with the c4.5 learning algorithm

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
The automatic extraction of open compounds from text corpora

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
A bio-inspired approach for multi-word expression extraction

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Research on Automatic Chinese Multi-word Term Extraction Based on Term Component

ICCPOL '09 Proceedings of the 22nd International Conference on Computer Processing of Oriental Languages. Language Technology for the Knowledge-based Economy
A Corpus-Based Approach for Automatic Thai Unknown Word Recognition using Ensemble Learning Techniques

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Word lookup on the basis of associations: from an idea to a roadmap

ElectricDict '04 Proceedings of the Workshop on Enhancing and Using Electronic Dictionaries
N-gram analysis based on zero-suppressed BDDs

JSAI'06 Proceedings of the 20th annual conference on New frontiers in artificial intelligence
Statistical substring reduction in linear time

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
Boosting-based ensemble learning with penalty profiles for automatic Thai unknown word recognition

Computers & Mathematics with Applications
Juggling the Jigsaw: towards automated problem inference from network trouble tickets

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation

Quantified Score

Hi-index	0.00

Visualization

Abstract

In the process of establishing the information theory, C. E. Shannon proposed the Markov process as a good model to characterize a natural language. The core of this idea is to calculate the frequencies of strings composed of n characters (n-grams), but this statistical analysis of large text data and for a large n has never been carried out because of the memory limitation of computer and the shortage of text data. Taking advantage of the recent powerful computers we developed a new algorithm of n-grams of large text data for arbitrary large n and calculated successfully, within relatively short time, n-grams of some Japanese text data containing between two and thirty million characters. From this experiment it became clear that the automatic extraction or determination of words, compound words and collocations is possible by mutually comparing n-gram statistics for different values of n.