Classifying Amharic webnews

Authors:
Lars Asker;Atelach Alemu Argaw;Björn Gambäck;Samuel Eyassu Asfeha;Lemma Nigussie Habte
Affiliations:
Department of Computer and Systems Sciences, Stockholm University, Stockholm, Sweden;Department of Computer and Systems Sciences, Stockholm University, Stockholm, Sweden;Department of Computer and Information Science, Norwegian University of Science and Technology, Trondheim, Norway and SICS, Swedish Institute of Computer Science AB, Kista, Sweden;Department of Information Science, Addis Ababa University, Addis Ababa, Ethiopia;Department of Information Science, Addis Ababa University, Addis Ababa, Ethiopia
Venue:
Information Retrieval
Year:
2009

Citing 24
Cited 1

Self-organization and associative memory: 3rd edition

Self-organization and associative memory: 3rd edition
Learning internal representations by error propagation

Parallel distributed processing: explorations in the microstructure of cognition, vol. 1
A self-organizing semantic map for information retrieval

SIGIR '91 Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval
A comparison of classifiers and document representations for the routing problem

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Using linear algebra for intelligent information retrieval

SIAM Review
Bagging predictors

Machine Learning
Feature selection, perceptron learning, and a usability case study for text categorization

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
A decision-theoretic generalization of on-line learning and an application to boosting

Journal of Computer and System Sciences - Special issue: 26th annual ACM symposium on the theory of computing & STOC'94, May 23–25, 1994, and second annual Europe an conference on computational learning theory (EuroCOLT'95), March 13–15, 1995
Hierarchical neural networks for text categorization (poster abstract)

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Self-Organizing Maps

Self-Organizing Maps
Empirical studies in strategies for Arabic retrieval

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Induction of Decision Trees

Machine Learning
Text categorization by boosting automatically extracted concepts

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Amharic Character Recognition using a Fast Signature Based Algorithm

IV '03 Proceedings of the Seventh International Conference on Information Visualization
Large-scale text categorization by batch mode active learning

Proceedings of the 15th international conference on World Wide Web
Amharic-English Information Retrieval

Evaluation of Multilingual and Multi-modal Information Retrieval
Amharic-English Information Retrieval with Pseudo Relevance Feedback

Advances in Multilingual and Multimodal Information Retrieval
Soft-supervised learning for text classification

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
An Amharic stemmer: reducing words to their citation forms

Semitic '07 Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources
Dictionary-based amharic: english information retrieval

CLEF'04 Proceedings of the 5th conference on Cross-Language Evaluation Forum: multilingual Information Access for Text, Speech and Images
Self organization of a massive document collection

IEEE Transactions on Neural Networks

Current research issues and trends in non-English Web searching

Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present work aimed at compiling an Amharic corpus from the Web and automatically categorizing the texts. Amharic is the second most spoken Semitic language in the World (after Arabic) and used for countrywide communication in Ethiopia. It is highly inflectional and quite dialectally diversified. We discuss the issues of compiling and annotating a corpus of Amharic news articles from the Web. This corpus was then used in three sets of text classification experiments. Working with a less-researched language highlights a number of practical issues that might otherwise receive less attention or go unnoticed. The purpose of the experiments has not primarily been to develop a cutting-edge text classification system for Amharic, but rather to put the spotlight on some of these issues. The first two sets of experiments investigated the use of Self-Organizing Maps (SOMs) for document classification. Testing on small datasets, we first looked at classifying unseen data into 10 predefined categories of news items, and then at clustering it around query content, when taking 16 queries as class labels. The second set of experiments investigated the effect of operations such as stemming and part-of-speech tagging on text classification performance. We compared three representations while constructing classification models based on bagging of decision trees for the 10 predefined news categories. The best accuracy was achieved using the full text as representation. A representation using only the nouns performed almost equally well, confirming the assumption that most of the information required for distinguishing between various categories actually is contained in the nouns, while stemming did not have much effect on the performance of the classifier.