Extracting shared topics of multiple documents

  • Authors:
  • Xiang Ji;Hongyuan Zha

  • Affiliations:
  • Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA;Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA

  • Venue:
  • PAKDD'03 Proceedings of the 7th Pacific-Asia conference on Advances in knowledge discovery and data mining
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we present a weighted graph based method to simultaneously compare the textual content of two or more documents and extract the shared (sub)topics of them, if available. A set of documents are modelled with a set of pairwise weighted bipartite graphs. A generalized mutual reinforcement principle is applied to the pairwise bipartite graphs to calculate the saliency scores of sentences in each documents based on pairwise weighted bipartite graphs. Sentences with advantaged saliency are selected, and they together convey the dominant shared topic. If there are more than one shared subtopics among the documents, a spectral min-max cut algorithm can be used to partition a derived sentence similarity graph into several subgraphs. For a subgraph, if all documents contribute some sentences(nodes) to it, then these sentences(nodes) in the subgraph may convey a shared subtopic. The generalized mutual reinforcement principle is applied to them to verify and extract the shared subtopic.