Overlapping statistical word indexing: a new indexing method for Japanese text
Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Problems of music information retrieval in the real world
Information Processing and Management: an International Journal
A compression-based algorithm for Chinese word segmentation
Computational Linguistics
A trainable rule-based algorithm for word segmentation
ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Chinese text segmentation with MBDP-1: making the most of training corpora
ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Learning case-based knowledge for disambiguating Chinese word segmentation: a preliminary study
SIGHAN '02 Proceedings of the first SIGHAN workshop on Chinese language processing - Volume 18
Integrating unsupervised and supervised word segmentation: The role of goodness measures
Information Sciences: an International Journal
Unsupervised segmentation of chinese corpus using accessor variety
IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
Hi-index | 0.00 |
Many languages, such as Chinese, are written without interword delimiters. For these languages, a segmenter is required as a pre-processing step for information retrieval systems. We describe USeg, a platform for word segmentation designed to fulfill the requirments imposed by the information retrieval task. USeg is based on an underlying probabalistic automaton which serves as a simple language model. A description of the proposed model(s), implementation issues for these models and experimental results are presented. The experiments show that a fairly simple underlying model can produce reasonable segmentation results, can do so quickly enough to be useful for indexing in an information retrieval system and can be re-targeted to new languages without a great deal of human effort.