Automatic word clustering in Russian texts

  • Authors:
  • Olga Mitrofanova;Anton Mukhin;Polina Panicheva;Vyacheslav Savitsky

  • Affiliations:
  • Department of Mathematical Linguistics, Faculty of Philology, St.-Petersburg State University, St.-Petersburg, Russia;Department of Mathematical Linguistics, Faculty of Philology, St.-Petersburg State University, St.-Petersburg, Russia;Department of Mathematical Linguistics, Faculty of Philology, St.-Petersburg State University, St.-Petersburg, Russia;Department of Mathematical Linguistics, Faculty of Philology, St.-Petersburg State University, St.-Petersburg, Russia

  • Venue:
  • TSD'07 Proceedings of the 10th international conference on Text, speech and dialogue
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

The paper deals with development and application of automatic word clustering (AWC) tool aimed at processing Russian texts of various types, which should satisfy the requirements of flexibility and compatibility with other linguistic resources. The construction of AWC tool requires computer implementation of latent semantic analysis (LSA) combined with clustering algorithms. To meet the need, Python-based software has been developed. Major procedures performed by AWC tool are segmentation of input texts and context analysis, co-occurrence matrix construction, agglomerative and K- means clustering. Special attention is drawn to experimental results on clustering words in raw texts with changing parameters.