n-Gram Statistics for Natural Language Understanding and Text Processing

Authors:
Ching Y. Suen
Affiliations:
SENIOR MEMBER, IEEE, Department of Computer Science, Concordia University, Montreal, P.Q., Canada/ Research Laboratory of Electronics, Massachusetts Institute of Technology, Cambri
Venue:
IEEE Transactions on Pattern Analysis and Machine Intelligence
Year:
1979

Citing 0
Cited 12

Rich document representation and classification: An analysis

Knowledge-Based Systems
Automatic identification of confusable drug names

Artificial Intelligence in Medicine
A comparison of language identification approaches on short, query-style texts

ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
I2DEE: an integrated and interactive data exploration environment used for ontology design

EKAW'06 Proceedings of the 15th international conference on Managing Knowledge in a World of Networks
Multiway-tree retrieval based on treegrams

ADBIS'97 Proceedings of the First East-European conference on Advances in Databases and Information systems
Relaxation labelling - the principle of 'least disturbance'

Pattern Recognition Letters
Similarity measures for sequential data

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
Non-syntactic word prediction for AAC

SLPAT '12 Proceedings of the Third Workshop on Speech and Language Processing for Assistive Technologies
Model matching for Web Services on context dependencies

Proceedings of the 14th International Conference on Information Integration and Web-based Applications & Services
An efficient distance metric for linear genetic programming

Proceedings of the 15th annual conference on Genetic and evolutionary computation
Estimating domain-based user influence in social networks

Proceedings of the 28th Annual ACM Symposium on Applied Computing
A close look on n-grams in intrusion detection: anomaly detection vs. classification

Proceedings of the 2013 ACM workshop on Artificial intelligence and security

Quantified Score

Hi-index	0.14

Visualization

Abstract

n-gram (n = 1 to 5) statistics and other properties of the English language were derived for applications in natural language understanding and text processing. They were computed from a well-known corpus composed of 1 million word samples. Similar properties were also derived from the most frequent 1000 words of three other corpuses. The positional distributions of n-grams obtained in the present study are discussed. Statistical studies on word length and trends of n-gram frequencies versus vocabulary are presented. In addition to a survey of n-gram statistics found in the literature, a collection of n-gram statistics obtained by other researchers is reviewed and compared.