Topic-based Amharic text summarization with probabilistic latent semantic analysis

Authors:
Eyob Delele Yirdaw;Dejene Ejigu
Affiliations:
Addis Ababa Univesity, Addis Ababa, Ethiopia;Addis Ababa Univesity, Addis Ababa, Ethiopia
Venue:
Proceedings of the International Conference on Management of Emergent Digital EcoSystems
Year:
2012

Citing 7
Cited 0

Unsupervised learning by probabilistic latent semantic analysis

Machine Learning
Generic text summarization using relevance measure and latent semantic analysis

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Topic-based document segmentation with probabilistic latent semantic analysis

Proceedings of the eleventh international conference on Information and knowledge management
Latent dirichlet allocation

The Journal of Machine Learning Research
A Scalable Topic-Based Open Source Search Engine

WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
Two uses of anaphora resolution in summarization

Information Processing and Management: an International Journal
Automatic text summarization of newswire: lessons learned from the document understanding conference

AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 3

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper investigates the problem of building a concept-based single-document Amharic text summarization system. Because local languages like Amharic lack extensive linguistic resources, we propose to use statistical approaches called topic modeling to create our text summarizer. The proposed algorithms are language and domain independent and hence can also be used for other local languages. More specifically, we propose to use the topic modeling approach of probabilistic latent semantic analysis (PLSA). We show that a principled use of the term by concept matrix that results from a PLSA model can help produce summaries that capture the main topics of a document. We propose and test six algorithms to help explore the use of the term by concept matrix. All of the algorithms have two common steps. In the first step, keywords of the document are selected using the term by concept matrix. In the second step, sentences that best contain the keywords are selected for inclusion in the summary. To take advantage of the kind of texts we experiment with (news articles) the algorithms always select the first sentence of the document for inclusion in the summary. After experimenting with corpus of news articles of different category at different extraction rates, the result obtained is encouraging.