Parsing with flexibility, dynamic strategies, and idioms in mind
Computational Linguistics
Ambiguity resolution and the retrieval of idioms: two approaches
COLING '90 Proceedings of the 13th conference on Computational linguistics - Volume 2
Lexical gaps and idioms in machine translation
COLING '90 Proceedings of the 13th conference on Computational linguistics - Volume 2
Critical tokenization and its properties
Computational Linguistics
Splitting-merging model of Chinese word tokenization and segmentation
Natural Language Engineering
Language independent morphological analysis
ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Learning case-based knowledge for disambiguating Chinese word segmentation: a preliminary study
SIGHAN '02 Proceedings of the first SIGHAN workshop on Chinese language processing - Volume 18
Chinese word segmentation as morpheme-based lexical chunking
Information Sciences: an International Journal
Discursive usage of six Chinese punctuation marks
COLING ACL '06 Proceedings of the 21st International Conference on computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop
Integrating unsupervised and supervised word segmentation: The role of goodness measures
Information Sciences: an International Journal
Hi-index | 0.00 |
In this paper, the authors address the significance and complexity of tokenization, the beginning step of NLP. Notions of word and token are discussed and defined from the viewpoints of lexicography and pragmatic implementation, respectively. Automatic segmentation of Chinese words is presented as an illustration of tokenization. Practical approaches to identification of compound tokens in English, such as idioms, phrasal verbs and fixed expressions, are developed.