Athena: Mining-Based Interactive Management of Text Database

  • Authors:
  • Rakesh Agrawal;Roberto J. Bayardo, Jr.;Ramakrishnan Srikant

  • Affiliations:
  • -;-;-

  • Venue:
  • EDBT '00 Proceedings of the 7th International Conference on Extending Database Technology: Advances in Database Technology
  • Year:
  • 2000

Quantified Score

Hi-index 0.00

Visualization

Abstract

We describe Athena: a system for creating, exploiting, and maintaining a hierarchy of textual documents through interactive mining-based operations. Requirements of any such system include speed and minimal end-user effort. Athena satisfies these requirements through linear-time classification and clustering engines which are applied interactively to speed the development of accurate models. Naive Bayes classifiers are recognized to be among the best for classifying text. We show that our specialization of the Naive Bayes classifier is considerably more accurate (7 to 29% absolute increase in accuracy) than a standard implementation. Our enhancements include using Lid-stone's law of succession instead of Laplace's law, under-weighting long documents, and over-weighting author and subject. We also present a new interactive clustering algorithm, C-Evolve, for topic discovery. C-Evolve first finds highly accurate cluster digests (partial clusters), gets user feedback to merge and correct these digests, and then uses the classification algorithm to complete the partitioning of the data. By allowing this interactivity in the clustering process, C-Evolve achieves considerably higher clustering accuracy (10 to 20% absolute increase in our experiments) than the popular K-Means and agglomerative clustering methods.