Organizing the OCA: learning faceted subjects from a library of digital books

  • Authors:
  • David Mimno;Andrew McCallum

  • Affiliations:
  • University of Massachusetts: Amherst, Amherst, MA;University of Massachusetts: Amherst, Amherst, MA

  • Venue:
  • Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Large scale library digitization projects such as the Open Content Alliance are producing vast quantities of text, but little has been done to organize this data. Subject headings inherited from card catalogs are useful but limited, while full-text indexing is most appropriate for readers who already know exactly what they want. Statistical topic models provide a complementary function. These models can identify semantically coherent "topics" that are easily recognizable and meaningful to humans, but they have been too computationally intensive to run on library-scale corpora. This paper presents DCM-LDA, a topic model based on Dirichlet Compound Multinomial distributions. This model is simultaneously better able to represent observed properties of text and more scalable to extremely large text collections. We train individual topic models for each book based on the cooccurrence of words within pages. We then cluster topics across books. The resulting topical clusters can be interpreted as subject facets, allowing readers to browse the topics of a collection quickly, find relevant books using topically expanded keyword searches, and explore topical relationships between books. We demonstrate this method finding topics on a corpus of 1.49 billion words from 42,000 books in less than 20 hours, and it easily could scale well beyond this.