Critical tokenization and its properties

Authors:
Jin Guo
Affiliations:
National University of Singapore
Venue:
Computational Linguistics
Year:
1997

Citing 11
Cited 15

Compilers: principles, techniques, and tools

Compilers: principles, techniques, and tools
Discrete mathematical structures for computer science (2nd ed.)

Discrete mathematical structures for computer science (2nd ed.)
Readings in natural language processing

Readings in natural language processing
Syntactic graphs: a representation for the union of all ambiguous parse trees

Computational Linguistics
Natural language understanding (2nd ed.)

Natural language understanding (2nd ed.)
A stochastic finite-state word-segmentation algorithm for Chinese

Computational Linguistics
A statistically emergent approach for language processing: application to modeling context effects in ambiguous Chinese word boundary perception

Computational Linguistics
The Theory of Parsing, Translation, and Compiling

The Theory of Parsing, Translation, and Compiling
Word identification for Mandarin Chinese sentences

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 1
Tokenization as the initial phase in NLP

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 4
Broad coverage automatic morphological segmentation of German words

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 4

Hybrid term indexing for different IR models

IRAL '00 Proceedings of the fifth international workshop on on Information retrieval with Asian languages
Universal Segmentation of Text with the Sumo Formalism

NLP '00 Proceedings of the Second International Conference on Natural Language Processing
Language independent morphological analysis

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
One tokenization per source

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
A formalism for universal segmentation of text

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
Accessor variety criteria for Chinese word extraction

Computational Linguistics
A character-net based Chinese text segmentation method

SEMANET '02 Proceedings of the 2002 workshop on Building and using semantic networks - Volume 11
Learning case-based knowledge for disambiguating Chinese word segmentation: a preliminary study

SIGHAN '02 Proceedings of the first SIGHAN workshop on Chinese language processing - Volume 18
A Hybrid Approach to Improve Bilingual Multiword Expression Extraction

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Discursive usage of six Chinese punctuation marks

COLING ACL '06 Proceedings of the 21st International Conference on computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop
By all these lovely tokens...: merging conflicting tokenizations

ACL-IJCNLP '09 Proceedings of the Third Linguistic Annotation Workshop
An example-based study on chinese word segmentation using critical fragments

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
Design of chinese word segmentation system based on improved chinese converse dictionary and reverse maximum matching algorithm

WISE'06 Proceedings of the 7th international conference on Web Information Systems
By all these lovely tokens... Merging conflicting tokenizations

Language Resources and Evaluation
Phrase-based approach for adaptive tokenization

SIGMORPHON '12 Proceedings of the Twelfth Meeting of the Special Interest Group on Computational Morphology and Phonology

Quantified Score

Hi-index	0.00

Visualization

Abstract

Tokenization is the process of mapping sentences from character strings into strings of words. This paper sets out to study critical tokenization, a distinctive type of tokenization following the principle of maximum tokenization. The objective in this paper is to develop its mathematical description and understanding.The main results are as follows: (1) Critical points are all and only unambiguous token boundaries for any character string on a complete dictionary; (2)Any critically tokenized word string is a minimal element in the partially ordered set of all tokenized word strings with respect to the word string cover relation; (3) Any tokenized string can be reproduced from a critically tokenized word string but not vice versa; (4) Critical tokenization forms the sound mathematical foundation for categorizing tokenization ambiguity into critical and hidden types, a precise mathematical understanding of conventional concepts like combinational and overlapping ambiguities; (5) Many important maximum tokenization variations, such as forward and backward maximum matching and shortest tokenization, are all true subclasses of critical tokenization.It is believed that critical tokenization provides a precise mathematical description of the principle of maximum tokenization. Important implications and practical applications of critical tokenization in effective ambiguity resolution and in efficient tokenization implementation are also carefully examined.