USe: A Retargetable Word Segmentation Procedure for Information Retrieval

  • Authors:
  • J. Ponte

  • Affiliations:
  • -

  • Venue:
  • USe: A Retargetable Word Segmentation Procedure for Information Retrieval
  • Year:
  • 1996

Quantified Score

Hi-index 0.00

Visualization

Abstract

Many languages, such as Chinese, are written without interword delimiters. For these languages, a segmenter is required as a pre-processing step for information retrieval systems. We describe USeg, a platform for word segmentation designed to fulfill the requirments imposed by the information retrieval task. USeg is based on an underlying probabalistic automaton which serves as a simple language model. A description of the proposed model(s), implementation issues for these models and experimental results are presented. The experiments show that a fairly simple underlying model can produce reasonable segmentation results, can do so quickly enough to be useful for indexing in an information retrieval system and can be re-targeted to new languages without a great deal of human effort.