Discovery of novel term associations in a document collection

  • Authors:
  • Teemu Hynönen;Sébastien Mahler;Hannu Toivonen

  • Affiliations:
  • Department of Computer Science and HIIT, University of Helsinki, Finland;Department of Computer Science and HIIT, University of Helsinki, Finland;Department of Computer Science and HIIT, University of Helsinki, Finland

  • Venue:
  • Bisociative Knowledge Discovery
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

We propose a method to mine novel, document-specific associations between terms in a collection of unstructured documents. We believe that documents are often best described by the relationships they establish. This is also evidenced by the popularity of conceptual maps, mind maps, and other similar methodologies to organize and summarize information. Our goal is to discover term relationships that can be used to construct conceptual maps or so called BisoNets. The model we propose, tpf---idf---tpu, looks for pairs of terms that are associated in an individual document. It considers three aspects, two of which have been generalized from tf---idf to term pairs: term pair frequency (tpf; importance for the document), inverse document frequency (idf; uniqueness in the collection), and term pair uncorrelation (tpu; independence of the terms). The last component is needed to filter out statistically dependent pairs that are not likely to be considered novel or interesting by the user. We present experimental results on two collections of documents: one extracted from Wikipedia, and one containing text mining articles with manually assigned term associations. The results indicate that the tpf---idf---tpu method can discover novel associations, that they are different from just taking pairs of tf---idf keywords, and that they match better the subjective associations of a reader.