Compilers: principles, techniques, and tools
Compilers: principles, techniques, and tools
Discrete mathematical structures for computer science (2nd ed.)
Discrete mathematical structures for computer science (2nd ed.)
Readings in natural language processing
Readings in natural language processing
Syntactic graphs: a representation for the union of all ambiguous parse trees
Computational Linguistics
Natural language understanding (2nd ed.)
Natural language understanding (2nd ed.)
A stochastic finite-state word-segmentation algorithm for Chinese
Computational Linguistics
The Theory of Parsing, Translation, and Compiling
The Theory of Parsing, Translation, and Compiling
Word identification for Mandarin Chinese sentences
COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 1
Tokenization as the initial phase in NLP
COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 4
Broad coverage automatic morphological segmentation of German words
COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 4
Hybrid term indexing for different IR models
IRAL '00 Proceedings of the fifth international workshop on on Information retrieval with Asian languages
Universal Segmentation of Text with the Sumo Formalism
NLP '00 Proceedings of the Second International Conference on Natural Language Processing
Language independent morphological analysis
ANLC '00 Proceedings of the sixth conference on Applied natural language processing
COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
A formalism for universal segmentation of text
COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
Accessor variety criteria for Chinese word extraction
Computational Linguistics
A character-net based Chinese text segmentation method
SEMANET '02 Proceedings of the 2002 workshop on Building and using semantic networks - Volume 11
Learning case-based knowledge for disambiguating Chinese word segmentation: a preliminary study
SIGHAN '02 Proceedings of the first SIGHAN workshop on Chinese language processing - Volume 18
A Hybrid Approach to Improve Bilingual Multiword Expression Extraction
PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Discursive usage of six Chinese punctuation marks
COLING ACL '06 Proceedings of the 21st International Conference on computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop
By all these lovely tokens...: merging conflicting tokenizations
ACL-IJCNLP '09 Proceedings of the Third Linguistic Annotation Workshop
An example-based study on chinese word segmentation using critical fragments
IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
WISE'06 Proceedings of the 7th international conference on Web Information Systems
By all these lovely tokens... Merging conflicting tokenizations
Language Resources and Evaluation
Phrase-based approach for adaptive tokenization
SIGMORPHON '12 Proceedings of the Twelfth Meeting of the Special Interest Group on Computational Morphology and Phonology
Hi-index | 0.00 |
Tokenization is the process of mapping sentences from character strings into strings of words. This paper sets out to study critical tokenization, a distinctive type of tokenization following the principle of maximum tokenization. The objective in this paper is to develop its mathematical description and understanding.The main results are as follows: (1) Critical points are all and only unambiguous token boundaries for any character string on a complete dictionary; (2)Any critically tokenized word string is a minimal element in the partially ordered set of all tokenized word strings with respect to the word string cover relation; (3) Any tokenized string can be reproduced from a critically tokenized word string but not vice versa; (4) Critical tokenization forms the sound mathematical foundation for categorizing tokenization ambiguity into critical and hidden types, a precise mathematical understanding of conventional concepts like combinational and overlapping ambiguities; (5) Many important maximum tokenization variations, such as forward and backward maximum matching and shortest tokenization, are all true subclasses of critical tokenization.It is believed that critical tokenization provides a precise mathematical description of the principle of maximum tokenization. Important implications and practical applications of critical tokenization in effective ambiguity resolution and in efficient tokenization implementation are also carefully examined.