A Generalized Approach to Word Segmentation Using Maximum Length Descending Frequency and Entropy Rate

Authors:
Md. Aminul Islam;Diana Inkpen;Iluju Kiringa
Affiliations:
School of Information Technology and Engineering, University of Ottawa, Ottawa, ON, K1N 6N5, Canada;School of Information Technology and Engineering, University of Ottawa, Ottawa, ON, K1N 6N5, Canada;School of Information Technology and Engineering, University of Ottawa, Ottawa, ON, K1N 6N5, Canada
Venue:
CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing
Year:
2009

Citing 14
Cited 1

On learning the past tenses of English verbs

Parallel distributed processing: explorations in the microstructure of cognition, vol. 2
Some advances in transformation-based part of speech tagging

AAAI '94 Proceedings of the twelfth national conference on Artificial intelligence (vol. 1)
A stochastic finite-state word-segmentation algorithm for Chinese

Computational Linguistics
IGTree: Using Trees for Compression and Classification in Lazy LearningAlgorithms

Artificial Intelligence Review - Special issue on lazy learning
An Efficient, Probabilistically Sound Algorithm for Segmentation andWord Discovery

Machine Learning - Special issue on natural language learning
Handbook of Natural Language Processing

Handbook of Natural Language Processing
The Unsupervised Acquisition of a Lexicon from Continuous Speech

The Unsupervised Acquisition of a Lexicon from Continuous Speech
Automatic rule induction for unknown-word guessing

Computational Linguistics
Improving Chinese tokenization with linguistic filters on statistical lexical acquisition

ANLC '94 Proceedings of the fourth conference on Applied natural language processing
A stochastic finite-state word-segmentation algorithm for Chinese

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
Corpus-Based Schema Matching

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
A Mathematical Theory of Communication

A Mathematical Theory of Communication
Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach

Computational Linguistics
COMA: a system for flexible combination of schema matching approaches

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases

Boosting-based ensemble learning with penalty profiles for automatic Thai unknown word recognition

Computers & Mathematics with Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we formulate a generalized method of automatic word segmentation. The method uses corpus type frequency information to choose the type with maximum length and frequency from "desegmented" text. It also uses a modified forward-backward matching technique using maximum length frequency and entropy rate if any non-matching portions of the text exist. The method is also extendible to a dictionary-based or hybrid method with some additions to the algorithms. Evaluation results show that our method outperforms several competing methods.