New perspectives in sinographic language processing through the use of character structure

Authors:
Yannis Haralambous
Affiliations:
Lab-STICC UMR CNRS 6285, Institut Télécom - Télécom Bretagne, France
Venue:
CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part I
Year:
2013

Citing 11
Cited 0

An Information-Theoretic Definition of Similarity

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Substructure Shape Analysis for Kanji Character Recognition

SSPR '96 Proceedings of the 6th International Workshop on Advances in Structural and Syntactical Pattern Recognition
A new algorithm for the alignment of phonetic sequences

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Structural patterns of Chinese characters

COLING '69 Proceedings of the 1969 conference on Computational linguistics
A character-net based Chinese text segmentation method

SEMANET '02 Proceedings of the 2002 workshop on Building and using semantic networks - Volume 11
Decomposition for ISO/IEC 10646 ideographic characters

COLING '02 Proceedings of the 3rd workshop on Asian language resources and international standardization - Volume 12
A Computational Theory of Writing Systems (Studies in Natural Language Processing)

A Computational Theory of Writing Systems (Studies in Natural Language Processing)
Using information content to evaluate semantic similarity in a taxonomy

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 1
Hanzi grid: toward a knowledge infrastructure for Chinese character-based cultures

IWIC'07 Proceedings of the 1st international conference on Intercultural collaboration
CHISE: character processing based on character ontology

LKR'08 Proceedings of the 3rd international conference on Large-scale knowledge resources: construction and application
Networks: An Introduction

Networks: An Introduction

Quantified Score

Hi-index	0.00

Visualization

Abstract

Chinese characters have a complex and hierarchical graphical structure carrying both semantic and phonetic information. We use this structure to enhance the text model and obtain better results in standard NLP operations. First of all, to tackle the problem of graphical variation we define allographic classes of characters. Next, the relation of inclusion of a subcharacter in a characters, provides us with a directed graph of allographic classes. We provide this graph with two weights: semanticity (semantic relation between subcharacter and character) and phoneticity (phonetic relation) and calculate "most semantic subcharacter paths" for each character. Finally, adding the information contained in these paths to unigrams we claim to increase the efficiency of text mining methods. We evaluate our method on a text classification task on two corpora (Chinese and Japanese) of a total of 18 million characters and get an improvement of 3% on an already high baseline of 89.6% precision, obtained by a linear SVM classifier. Other possible applications and perspectives of the system are discussed.