Wordica: Emergence of linguistic representations for words by independent component analysis

  • Authors:
  • Timo Honkela;Aapo HyvÄrinen;Jaakko j. VÄyrynen

  • Affiliations:
  • Adaptive informatics research centre, aalto university school of science and technology, p.o. box 15400, fi-00076 aalto, finland e-mail: timo.honkela@tkk.fi;Department of mathematics and statistics, department of computer science, university of helsinki, p.o. box 68, fi-00014 university of helsinki, finland and helsinki institute for information techn ...;Adaptive informatics research centre, aalto university school of science and technology, p.o. box 15400, fi-00076 aalto, finland

  • Venue:
  • Natural Language Engineering
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

We explore the use of independent component analysis (ICA) for the automatic extraction of linguistic roles or features of words. The extraction is based on the unsupervised analysis of text corpora. We contrast ICA with singular value decomposition (SVD), widely used in statistical text analysis, in general, and specifically in latent semantic analysis (LSA). However, the representations found using the SVD analysis cannot easily be interpreted by humans. In contrast, ICA applied on word context data gives distinct features which reflect linguistic categories. In this paper, we provide justification for our approach called WordICA, present the WordICA method in detail, compare the obtained results with traditional linguistic categories and with the results achieved using an SVD-based method, and discuss the use of the method in practical natural language engineering solutions such as machine translation systems. As the WordICA method is based on unsupervised learning and thus provides a general means for efficient knowledge acquisition, we foresee that the approach has a clear potential for practical applications.