Dynamic programming: a method for taking advantage of technical terminology in Japanese documents
IRAL '00 Proceedings of the fifth international workshop on on Information retrieval with Asian languages
Glossary-Based MT Engines in a Multilingual Analyst‘s Workstation Architecture
Machine Translation
Unit Completion for a Computer-aided Translation Typing System
Machine Translation
AI '99 Proceedings of the 12th Australian Joint Conference on Artificial Intelligence: Advanced Topics in Artificial Intelligence
Searching large text collections
Handbook of massive data sets
MARSYAS: a framework for audio analysis
Organised Sound
Mostly-unsupervised statistical segmentation of Japanese Kanji sequences
Natural Language Engineering
Compound noun segmentation based on lexical data extracted from corpus
Natural Language Engineering
Unit completion for a computer-aided translation typing system
ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Compound noun segmentation based on lexical data extracted from corpus
ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Mostly-unsupervised statistical segmentation of Japanese: applications to kanji
NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Retrieving collocations by co-occurrences and word order constraints
ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
The effects of word order and segmentation on translation retrieval performance
COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Automatic semantic sequence extraction from unrestricted non-tagged texts
COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Extracting nested collocations
COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
Learning bilingual collocations by word-level sorting
COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
Automatic corpus-based Thai word extraction with the c4.5 learning algorithm
COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
The automatic extraction of open compounds from text corpora
COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
A bio-inspired approach for multi-word expression extraction
COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Research on Automatic Chinese Multi-word Term Extraction Based on Term Component
ICCPOL '09 Proceedings of the 22nd International Conference on Computer Processing of Oriental Languages. Language Technology for the Knowledge-based Economy
PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Word lookup on the basis of associations: from an idea to a roadmap
ElectricDict '04 Proceedings of the Workshop on Enhancing and Using Electronic Dictionaries
N-gram analysis based on zero-suppressed BDDs
JSAI'06 Proceedings of the 20th annual conference on New frontiers in artificial intelligence
Statistical substring reduction in linear time
IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
Boosting-based ensemble learning with penalty profiles for automatic Thai unknown word recognition
Computers & Mathematics with Applications
Juggling the Jigsaw: towards automated problem inference from network trouble tickets
nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Hi-index | 0.00 |
In the process of establishing the information theory, C. E. Shannon proposed the Markov process as a good model to characterize a natural language. The core of this idea is to calculate the frequencies of strings composed of n characters (n-grams), but this statistical analysis of large text data and for a large n has never been carried out because of the memory limitation of computer and the shortage of text data. Taking advantage of the recent powerful computers we developed a new algorithm of n-grams of large text data for arbitrary large n and calculated successfully, within relatively short time, n-grams of some Japanese text data containing between two and thirty million characters. From this experiment it became clear that the automatic extraction or determination of words, compound words and collocations is possible by mutually comparing n-gram statistics for different values of n.