Foundations of statistical natural language processing
Foundations of statistical natural language processing
The Journal of Machine Learning Research
ICML '06 Proceedings of the 23rd international conference on Machine learning
LDA-based document models for ad-hoc retrieval
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Pattern Recognition and Machine Learning (Information Science and Statistics)
Pattern Recognition and Machine Learning (Information Science and Statistics)
Ranking Answers by Hierarchical Topic Models
IEA/AIE '09 Proceedings of the 22nd International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems: Next-Generation Applied Intelligence
Question classification using head words and their hypernyms
EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
A character-based joint model for Chinese word segmentation
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
A Topic Model of Observing Chinese Characters
IHMSC '10 Proceedings of the 2010 Second International Conference on Intelligent Human-Machine Systems and Cybernetics - Volume 02
Probabilistic latent semantic analysis
UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence
Expectation-propagation for the generative aspect model
UAI'02 Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence
An efficient minimum vocabulary construction algorithm for language modeling
IEA/AIE'12 Proceedings of the 25th international conference on Industrial Engineering and Other Applications of Applied Intelligent Systems: advanced research in applied artificial intelligence
Hi-index | 0.00 |
Chinese language has been generally regarded as a Subject-Verb - Object (SVO) language and the basic semantic unit is the Chinese word that is usually consisted by two or more Chinese characters. However, word-centered structure of Chinese language has been controversial in linguistics. Some recent research in computational linguistics in Chinese language suggests that the character-based models perform better than the word-based models in some applications such word segmentation. In this paper, the word-based topic models and the character-based models are tested for modeling Chinese language, respectively. By empirical studies, we demonstrated the effectiveness of using Chinese characters as the basic semantic units. These two models have close performance in text classifications while the character-based model has a better quality in language modeling and a much smaller vocabulary. By testing on a bilingual corpus, three independent topic models based on Chinese words, Chinese characters and English words are trained and compared to each other. we verify the capability of topic models in modeling semantics by experiments across Chinese and English. The classification accuracy can also be boosted up by aggregating the classification results from the three independent topic models.