Dirichlet distribution with centroid model (DDCM) based summarization technique for web document classification

  • Authors:
  • Setu K. Chaturvedi;D. K. Swami;Gulab Singh

  • Affiliations:
  • Technocrats Institute of Technology, Bhopal, MP, India;VNS Institute of Technology, Bhopal, MP, India;Technocrats Institute of Technology, Bhopal, MP, India

  • Venue:
  • COMPUTE '11 Proceedings of the Fourth Annual ACM Bangalore Conference
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Web document summarization deals with computing a summary for a set of related articles such that they give the user a general view about the events. One of the summarization objectives is that the sentences should cover the different events in the documents with the information covered in as few sentences as possible. Dirichlet Distribution Model can break down these documents into different sentence or events. However to reduce the common information content the sentences of the summary need to be orthogonal to each other since orthogonal vectors have the lowest possible similarity and correlation between them. Centroid Value Decomposition is used to get the orthogonal representations of vectors and representing sentences as vectors, we can get the sentences that are orthogonal in our proposed DDCM. Thus using DDM we get the different sentence in the document and using Centroid Model we find the words that best represent these sentences. The goal of this paper is to find minimum number of highly qualitative features by generating best summarization for web document classification. We conducted experiments with various Centroid based numbers of summarization approaches and obtain effective classification results. Experimental results show that our proposed DDCM summarization based classification approach achieved more accurate and improved result as compared to full text based classification.