An empirical study on dimensionality optimization in text mining for linguistic knowledge acquisition

Authors:
Yu-Seop Kim;Jeong-Ho Chang;Byoung-Tak Zhang
Affiliations:
Division of Information and Telecommunication Engineering, Hallym University, Kang-Won, Korea;School of Computer Science and Engineering, Seoul National University, Seoul, Korea;School of Computer Science and Engineering, Seoul National University, Seoul, Korea
Venue:
PAKDD'03 Proceedings of the 7th Pacific-Asia conference on Advances in knowledge discovery and data mining
Year:
2003

Citing 4
Cited 8

Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Collocation Dictionary Optimization Using WordNetand k-Nearest Neighbor Learning

Machine Translation
SVDPACKC (Version 1.0) User''s Guide

SVDPACKC (Version 1.0) User''s Guide
A comparative evaluation of data-driven models in translation selection of machine translation

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1

Web usage mining based on probabilistic latent semantic analysis

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Improving Text Segmentation Using Latent Semantic Analysis: A Reanalysis of Choi, Wiemer-Hastings, and Moore (2001)

Computational Linguistics
An empirical study of required dimensionality for large-scale latent semantic indexing applications

Proceedings of the 17th ACM conference on Information and knowledge management
Efficient storage and retrieval of probabilistic latent semantic information for information retrieval

The VLDB Journal — The International Journal on Very Large Data Bases
Probabilistic latent semantic user segmentation for behavioral targeted advertising

Proceedings of the Third International Workshop on Data Mining and Audience Intelligence for Advertising
Efficient Probabilistic Latent Semantic Analysis through Parallelization

AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
Data mining for web personalization

The adaptive web
A mixture model for expert finding

PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we try to find empirically the optimal dimensionality in data-driven models, Latent Semantic Analysis (LSA) model and Probabilistic Latent Semantic Analysis (PLSA) model. These models are used for building linguistic semantic knowledge which could be used in estimating contextual semantic similarity for the target word selection in English-Korean machine translation. We also facilitate k-Nearest Neighbor learning algorithm. We diversify our experiments by analyzing the covariance between the value of k in k-NN learning and accuracy of selection, in addition to that between the dimensionality and the accuracy. While we could not find regular tendency of relationship between the dimensionality and the accuracy, however, we could find the optimal dimensionality having the most sound distribution of data during experiments.