USe: A Retargetable Word Segmentation Procedure for Information Retrieval

Authors:
J. Ponte
Affiliations:
-
Venue:
USe: A Retargetable Word Segmentation Procedure for Information Retrieval
Year:
1996

Citing 0
Cited 8

Overlapping statistical word indexing: a new indexing method for Japanese text

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Problems of music information retrieval in the real world

Information Processing and Management: an International Journal
A compression-based algorithm for Chinese word segmentation

Computational Linguistics
A trainable rule-based algorithm for word segmentation

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Chinese text segmentation with MBDP-1: making the most of training corpora

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Learning case-based knowledge for disambiguating Chinese word segmentation: a preliminary study

SIGHAN '02 Proceedings of the first SIGHAN workshop on Chinese language processing - Volume 18
Integrating unsupervised and supervised word segmentation: The role of goodness measures

Information Sciences: an International Journal
Unsupervised segmentation of chinese corpus using accessor variety

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many languages, such as Chinese, are written without interword delimiters. For these languages, a segmenter is required as a pre-processing step for information retrieval systems. We describe USeg, a platform for word segmentation designed to fulfill the requirments imposed by the information retrieval task. USeg is based on an underlying probabalistic automaton which serves as a simple language model. A description of the proposed model(s), implementation issues for these models and experimental results are presented. The experiments show that a fairly simple underlying model can produce reasonable segmentation results, can do so quickly enough to be useful for indexing in an information retrieval system and can be re-targeted to new languages without a great deal of human effort.