A clustering study of a 7000 EU document inventory using MDS and SOM

Authors:
Patrick A. De Mazière;Marc M. Van Hulle
Affiliations:
Laboratorium voor Neuro-en Psychofysiologie, K.U.Leuven, Leuven, Belgium;Laboratorium voor Neuro-en Psychofysiologie, K.U.Leuven, Leuven, Belgium
Venue:
Expert Systems with Applications: An International Journal
Year:
2011

Citing 25
Cited 1

Self-organization and associative memory: 3rd edition

Self-organization and associative memory: 3rd edition
n-Grams and their implication to natural language understanding

Pattern Recognition
Mastering regular expressions

Mastering regular expressions
Statistical methods for speech recognition

Statistical methods for speech recognition
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Programming Techniques: Regular expression search algorithm

Communications of the ACM
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
A Scalable Parallel Algorithm for Self-Organizing Maps with Applicationsto Sparse Data Mining Problems

Data Mining and Knowledge Discovery
Comparison of character-level and part of speech features for name recognition in biomedical texts

Journal of Biomedical Informatics - Special issue: Named entity recognition in biomedicine
Feature-rich part-of-speech tagging with a cyclic dependency network

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Enriching the knowledge sources used in a maximum entropy part-of-speech tagger

EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13
Text mining without document context

Information Processing and Management: an International Journal - Special issue: Informetrics
Incorporating non-local information into information extraction systems by Gibbs sampling

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Intrusion detection in web applications using text mining

Engineering Applications of Artificial Intelligence
Combining fuzzy AHP with MDS in identifying the preference similarity of alternatives

Applied Soft Computing
Text document clustering based on frequent word meaning sequences

Data & Knowledge Engineering
Text Document Preprocessing with the Bayes Formula for Classification Using the Support Vector Machine

IEEE Transactions on Knowledge and Data Engineering
Neurolinguistic approach to natural language processing with applications to medical text analysis

Neural Networks
Using the self organizing map for clustering of text documents

Expert Systems with Applications: An International Journal
Identifying disgruntled employee systems fraud risk through text mining: A simple solution for a multi-billion dollar problem

Decision Support Systems
Automatic generation of semantically enriched web pages by a text mining approach

Expert Systems with Applications: An International Journal
@Note: A workbench for Biomedical Text Mining

Journal of Biomedical Informatics
Text-mining approach to evaluate terms for ontology development

Journal of Biomedical Informatics
A text mining approach for automatic construction of hypertexts

Expert Systems with Applications: An International Journal
Data Mining: Practical Machine Learning Tools and Techniques

Data Mining: Practical Machine Learning Tools and Techniques

Document clustering method using dimension reduction and support vector clustering to overcome sparseness

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	12.05

Visualization

Abstract

In this article, we discuss a number of methods and tools to cluster a 7000 document inventory in order to evaluate the impact of EU funded research in social sciences and humanities on EU policies. The inventory, which is not publicly available, but provided to us by the European Union (EU) in the framework of an EU project, could be divided into three main categories: research documents, influential policy documents, and policy documents. To represent the results in a way that non-experts could make use of it, we explored and compared two visualisation techniques, multi-dimensional scaling (MDS) and the self-organising map (SOM), and one of the latter's derivatives, the U-matrix. Contrary to most other approaches, which perform text analyses only on document titles and abstracts, we performed a full text analysis on more than 300,000 pages in total. Due to the inability of many software suites to handle text mining problems of this size, we developed our own analysis platform. We show that the combination of a U-matrix and an MDS map, which is rarely performed in the domain of text mining, reveals information that would go unnoticed otherwise. Furthermore, we show that the combination of a database, to store the data and the (intermediate) results, and a webserver, to visualise the results, offers a powerful platform to analyse the data and share the results with all participants/collaborators involved in a data- and computation intensive EU-project, thereby guaranteeing both data- and result consistency.