Text Mining in the SOMLib Digital Library System: The Representation of Topics and Genres

Authors:
Andreas Rauber;Dieter Merkl
Affiliations:
Department of Software Technology, Vienna University of Technology, Favoritenstr. 9-11/188, A-1040 Vienna, Austria. andi@ifs.tuwien.ac.at;Department of Software Technology, Vienna University of Technology, Favoritenstr. 9-11/188, A-1040 Vienna, Austria. dieter@ifs.tuwien.ac.at
Venue:
Applied Intelligence
Year:
2003

Citing 0
Cited 5

Market segmentation based on hierarchical self-organizing map for markets of multimedia on demand

Expert Systems with Applications: An International Journal
Organizing the OCA: learning faceted subjects from a library of digital books

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Digital web library of a website with document clustering

IBERAMIA'10 Proceedings of the 12th Ibero-American conference on Advances in artificial intelligence
Adding SOMLib capabilities to the greenstone digital library system

ICADL'06 Proceedings of the 9th international conference on Asian Digital Libraries: achievements, Challenges and Opportunities
Double enhancement learning for explicit internal representations: unifying self-enhancement and information enhancement to incorporate information on input variables

Applied Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the increasing amount of textual information available in electronic form, more powerful methods for exploring, searching, and organizing the available mass of information are needed to cope with this situation. This paper presents the SOMLIb digital library system, built on neural networks to provide text mining capabilities. At its foundation we use the Self-Organizing Map to provide content-based clustering of documents. By using an extended model, i.e. the Growing Hierarchical Self-Organizing Map, we can further detect subject hierarchies in a document collection, with the neural network adapting its size and structure automatically during its unsupervised training process to reflect the topical hierarchy. By mining the weight vector structure of the trained maps our system is able to select keywords describing the various topical clusters. Text mining has to incorporate more than the mere analysis of content. Structural and genre information are key in organizing and locating information. Using color-coding techniques we can integrate a structural analysis of documents based on Self-Organizing Maps into the subject-based clustering relying on metaphor graphics for intuitive visualization. We demonstrate the capabilities of the SOMLib system using collections of articles from various newspapers and magazines.