Language models learning for domain-specific natural language user interaction

Authors:
Shuanhu Bai;Chien-Lin Huang;Yeow-Kee Tan;Bin Ma
Affiliations:
Social Robot Group, Institute for Infocomm Research, Singapore;Social Robot Group, Institute for Infocomm Research, Singapore;Social Robot Group, Institute for Infocomm Research, Singapore;Social Robot Group, Institute for Infocomm Research, Singapore
Venue:
ROBIO'09 Proceedings of the 2009 international conference on Robotics and biomimetics
Year:
2009

Citing 7
Cited 0

Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Unsupervised learning by probabilistic latent semantic analysis

Machine Learning
Cluster-based retrieval using language models

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Semi-supervised classification with hybrid generative/discriminative methods

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Topic-bridged PLSA for cross-domain text classification

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Text data acquisition for domain-specific language models

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
N-gram weighting: reducing training data mismatch in cross-domain language model estimation

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Natural language interface is an important research topic in the area of natural language processing (NLP). Natural language interaction with robot could be the most natural and efficient way. In order to build speech enabled human language interface of robots, our research goal is to study the problems in this area and develop technologies that can potentially improve human-robot interaction. In particular, we present a learning method for building domain-specific language models (LM) for natural language user interfaces. This method is aimed to use small amount of domain-specific data as seeds to tap domain-specific resources residing in larger amount of general-domain data with the help of topic modeling technologies. The proposed algorithm first performs topic decomposition (TD) on the combined dataset of domain-specific and general-domain data using probabilistic latent semantic analysis (PLSA). Then it derives weighted domain-specific word n-gram counts with mixture modeling scheme of PLSA. Finally, it uses traditional n-gram modeling approach to construct domain-specific LMs from the domain-specific word n-gram counts. Experimental results show that this approach can outperform both stat-of-the-art methods and traditional supervised learning method. In addition, the semi-supervised learning method can achieve better performance even with very small amount of domain-specific data.