Using correlation dimension for analysing text data

Authors:
Ilkka Kivimäki;Krista Lagus;Ilari T. Nieminen;Jaakko J. Väyrynen;Timo Honkela
Affiliations:
Adaptive Informatics Research Centre, Aalto University School of Science and Technology;Adaptive Informatics Research Centre, Aalto University School of Science and Technology;Adaptive Informatics Research Centre, Aalto University School of Science and Technology;Adaptive Informatics Research Centre, Aalto University School of Science and Technology;Adaptive Informatics Research Centre, Aalto University School of Science and Technology
Venue:
ICANN'10 Proceedings of the 20th international conference on Artificial neural networks: Part I
Year:
2010

Citing 8
Cited 0

Spoken letter recognition

HLT '90 Proceedings of the workshop on Speech and Natural Language
Foundations of statistical natural language processing

Foundations of statistical natural language processing
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Document Clustering Using Locality Preserving Indexing

IEEE Transactions on Knowledge and Data Engineering
An Algorithm for Finding Intrinsic Dimensionality of Data

IEEE Transactions on Computers
On the Quantization Error in SOM vs. VQ: A Critical and Systematic Study

WSOM '09 Proceedings of the 7th International Workshop on Advances in Self-Organizing Maps
Filaments of meaning in word space

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Measuring the complexity of a collection of documents

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval

Quantified Score

Hi-index	0.02

Visualization

Abstract

In this article, we study the scale-dependent dimensionality properties and overall structure of text data with a method that measures correlation dimension in different scales. As experimental results, we present the analysis of text data sets with the Reuters and Europarl corpora, which are also compared to artificially generated point sets. A comparison is also made with speech data. The results reflect some of the typical properties of the data and the use of our method in improving various data analysis applications is discussed.