Stemming and lemmatization in the clustering of finnish text documents

Authors:
Tuomo Korenius;Jorma Laurikkala;Kalervo Järvelin;Martti Juhola
Affiliations:
University of Tampere, Finland;University of Tampere, Finland;University of Tampere, Finland;University of Tampere, Finland
Venue:
Proceedings of the thirteenth ACM international conference on Information and knowledge management
Year:
2004

Citing 13
Cited 15

Recent trends in hierarchic document clustering: a critical review

Information Processing and Management: an International Journal
Comparison of hierarchic agglomerative clustering methods for document retrieval

The Computer Journal
Clustering algorithms

Information retrieval
Viewing morphology as an inference process

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Stemming algorithms: a case study for detailed evaluation

Journal of the American Society for Information Science - Special issue: evaluation of information retrieval systems
Stemming methodologies over individual query words for an Arabic information retrieval system

Journal of the American Society for Information Science
A stemming procedure and stopword list for general French corpora

Journal of the American Society for Information Science
Machine Learning

Machine Learning
From Plain Character Strings to Meaningful Words: Producing Better Full Text Databases for Inflectional and Compounding Languages with Morphological Analysis Software

Information Retrieval
Hierarchical Clustering Using Non-Greedy Principal Direction Divisive Partitioning

Information Retrieval
Using graded relevance assessments in IR evaluation

Journal of the American Society for Information Science and Technology
Hierarchical clustering of a Finnish newspaper article collection with graded relevance assessments

Information Retrieval
Cluster Analysis

Cluster Analysis

Light stemming approaches for the French, Portuguese, German and Hungarian languages

Proceedings of the 2006 ACM symposium on Applied computing
Searching strategies for the Hungarian language

Information Processing and Management: an International Journal
A novel Arabic lemmatization algorithm

Proceedings of the second workshop on Analytics for noisy unstructured text data
A Mixed Method Lemmatization Algorithm Using a Hierarchy of Linguistic Identities (HOLI)

GoTAL '08 Proceedings of the 6th international conference on Advances in Natural Language Processing
A lemmatization method for Mongolian and its application to indexing for information retrieval

Information Processing and Management: an International Journal
Indexing and stemming approaches for the Czech language

Information Processing and Management: an International Journal
Indexing and searching strategies for the Russian language

Journal of the American Society for Information Science and Technology
Comparative Study of Indexing and Search Strategies for the Hindi, Marathi, and Bengali Languages

ACM Transactions on Asian Language Information Processing (TALIP)
Implementation of a new method for stemming in Persian language

Proceedings of the International Conference on Web Intelligence, Mining and Semantics
Four stemmers and a funeral: stemming in hungarian at CLEF 2005

CLEF'05 Proceedings of the 6th international conference on Cross-Language Evalution Forum: accessing Multilingual Information Repositories
Lexical normalization and relationship alternatives for a term dependence model in information retrieval

CICLing'06 Proceedings of the 7th international conference on Computational Linguistics and Intelligent Text Processing
Tools for nominalization: an alternative for lexical normalization

PROPOR'06 Proceedings of the 7th international conference on Computational Processing of the Portuguese Language
Athena: text mining based discovery of scientific workflows in disperse repositories

RED'10 Proceedings of the Third international conference on Resource Discovery
Clustering and categorization of Brazilian portuguese legal documents

PROPOR'12 Proceedings of the 10th international conference on Computational Processing of the Portuguese Language
Clustering a very large number of textual unstructured customers' reviews in english

AIMSA'12 Proceedings of the 15th international conference on Artificial Intelligence: methodology, systems, and applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Stemming and lemmatization were compared in the clustering of Finnish text documents. Since Finnish is a highly inflectional and agglutinative language, we hypothesized that lemmatization, involving splitting of the compound words, would be more appropriate normalization approach than the straightforward stemming. The relevance of the documents were evaluated with a four-point relevance assessment scale, which was collapsed into binary one by considering all the relevant and only the highly relevant documents relevant, respectively. Experiments with four hierarchical clustering methods supported the hypothesis. The stringent relevance scale showed that lemmatization allowed the single and complete linkage methods to recover especially the highly relevant documents better than stemming. In comparison with stemming, lemmatization together with the average linkage and Ward's methods produced higher precision. We conclude that lemmatization is a better word normalization method than stemming, when Finnish text documents are clustered for information retrieval.