An Information-Theoretic Definition of Similarity
ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Substructure Shape Analysis for Kanji Character Recognition
SSPR '96 Proceedings of the 6th International Workshop on Advances in Structural and Syntactical Pattern Recognition
A new algorithm for the alignment of phonetic sequences
NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Structural patterns of Chinese characters
COLING '69 Proceedings of the 1969 conference on Computational linguistics
A character-net based Chinese text segmentation method
SEMANET '02 Proceedings of the 2002 workshop on Building and using semantic networks - Volume 11
Decomposition for ISO/IEC 10646 ideographic characters
COLING '02 Proceedings of the 3rd workshop on Asian language resources and international standardization - Volume 12
A Computational Theory of Writing Systems (Studies in Natural Language Processing)
A Computational Theory of Writing Systems (Studies in Natural Language Processing)
Using information content to evaluate semantic similarity in a taxonomy
IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 1
Hanzi grid: toward a knowledge infrastructure for Chinese character-based cultures
IWIC'07 Proceedings of the 1st international conference on Intercultural collaboration
CHISE: character processing based on character ontology
LKR'08 Proceedings of the 3rd international conference on Large-scale knowledge resources: construction and application
Networks: An Introduction
Hi-index | 0.00 |
Chinese characters have a complex and hierarchical graphical structure carrying both semantic and phonetic information. We use this structure to enhance the text model and obtain better results in standard NLP operations. First of all, to tackle the problem of graphical variation we define allographic classes of characters. Next, the relation of inclusion of a subcharacter in a characters, provides us with a directed graph of allographic classes. We provide this graph with two weights: semanticity (semantic relation between subcharacter and character) and phoneticity (phonetic relation) and calculate "most semantic subcharacter paths" for each character. Finally, adding the information contained in these paths to unigrams we claim to increase the efficiency of text mining methods. We evaluate our method on a text classification task on two corpora (Chinese and Japanese) of a total of 18 million characters and get an improvement of 3% on an already high baseline of 89.6% precision, obtained by a linear SVM classifier. Other possible applications and perspectives of the system are discussed.