A Generalized Approach to Word Segmentation Using Maximum Length Descending Frequency and Entropy Rate

  • Authors:
  • Md. Aminul Islam;Diana Inkpen;Iluju Kiringa

  • Affiliations:
  • School of Information Technology and Engineering, University of Ottawa, Ottawa, ON, K1N 6N5, Canada;School of Information Technology and Engineering, University of Ottawa, Ottawa, ON, K1N 6N5, Canada;School of Information Technology and Engineering, University of Ottawa, Ottawa, ON, K1N 6N5, Canada

  • Venue:
  • CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we formulate a generalized method of automatic word segmentation. The method uses corpus type frequency information to choose the type with maximum length and frequency from "desegmented" text. It also uses a modified forward-backward matching technique using maximum length frequency and entropy rate if any non-matching portions of the text exist. The method is also extendible to a dictionary-based or hybrid method with some additions to the algorithms. Evaluation results show that our method outperforms several competing methods.