Topic-based Amharic text summarization with probabilistic latent semantic analysis

  • Authors:
  • Eyob Delele Yirdaw;Dejene Ejigu

  • Affiliations:
  • Addis Ababa Univesity, Addis Ababa, Ethiopia;Addis Ababa Univesity, Addis Ababa, Ethiopia

  • Venue:
  • Proceedings of the International Conference on Management of Emergent Digital EcoSystems
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper investigates the problem of building a concept-based single-document Amharic text summarization system. Because local languages like Amharic lack extensive linguistic resources, we propose to use statistical approaches called topic modeling to create our text summarizer. The proposed algorithms are language and domain independent and hence can also be used for other local languages. More specifically, we propose to use the topic modeling approach of probabilistic latent semantic analysis (PLSA). We show that a principled use of the term by concept matrix that results from a PLSA model can help produce summaries that capture the main topics of a document. We propose and test six algorithms to help explore the use of the term by concept matrix. All of the algorithms have two common steps. In the first step, keywords of the document are selected using the term by concept matrix. In the second step, sentences that best contain the keywords are selected for inclusion in the summary. To take advantage of the kind of texts we experiment with (news articles) the algorithms always select the first sentence of the document for inclusion in the summary. After experimenting with corpus of news articles of different category at different extraction rates, the result obtained is encouraging.