An empirical study on dimensionality optimization in text mining for linguistic knowledge acquisition

  • Authors:
  • Yu-Seop Kim;Jeong-Ho Chang;Byoung-Tak Zhang

  • Affiliations:
  • Division of Information and Telecommunication Engineering, Hallym University, Kang-Won, Korea;School of Computer Science and Engineering, Seoul National University, Seoul, Korea;School of Computer Science and Engineering, Seoul National University, Seoul, Korea

  • Venue:
  • PAKDD'03 Proceedings of the 7th Pacific-Asia conference on Advances in knowledge discovery and data mining
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we try to find empirically the optimal dimensionality in data-driven models, Latent Semantic Analysis (LSA) model and Probabilistic Latent Semantic Analysis (PLSA) model. These models are used for building linguistic semantic knowledge which could be used in estimating contextual semantic similarity for the target word selection in English-Korean machine translation. We also facilitate k-Nearest Neighbor learning algorithm. We diversify our experiments by analyzing the covariance between the value of k in k-NN learning and accuracy of selection, in addition to that between the dimensionality and the accuracy. While we could not find regular tendency of relationship between the dimensionality and the accuracy, however, we could find the optimal dimensionality having the most sound distribution of data during experiments.