One tokenization per source

Authors:
Jin Guo
Affiliations:
Kent Ridge Digital Labs, Singapore
Venue:
COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Year:
1998

Citing 8
Cited 1

A corpus-based approach to language learning

A corpus-based approach to language learning
A stochastic finite-state word-segmentation algorithm for Chinese

Computational Linguistics
A statistically emergent approach for language processing: application to modeling context effects in ambiguous Chinese word boundary perception

Computational Linguistics
An Analytical Model of Scheduling for Conservative Parallel Simulation

Proceedings of the 9th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Stochastic inversion transduction grammars and bilingual parsing of parallel corpora

Computational Linguistics
Critical tokenization and its properties

Computational Linguistics
One sense per discourse

HLT '91 Proceedings of the workshop on Speech and Natural Language
One sense per collocation

HLT '93 Proceedings of the workshop on Human Language Technology

An example-based study on chinese word segmentation using critical fragments

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We report in this paper the observation of one tokenization per source. That is, the same critical fragment in different sentences from the same source almost always realize one and the same of its many possible tokenizations. This observation is demonstrated very helpful in sentence tokenization practice, and is argued to be with far-reaching implications in natural language processing.