The utility of information extraction in the classification of books

  • Authors:
  • Tom Betts;Maria Milosavljevic;Jon Oberlander

  • Affiliations:
  • School of Informatics, University of Edinburgh, Edinburgh, UK;School of Informatics, University of Edinburgh, Edinburgh, UK;School of Informatics, University of Edinburgh, Edinburgh, UK

  • Venue:
  • ECIR'07 Proceedings of the 29th European conference on IR research
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

We describe work on automatically assigning classification labels to books using the Library of Congress Classification scheme. This task is non-trivial due to the volume and variety of books that exist. We explore the utility of Information Extraction (IE) techniques within this text categorisation (TC) task, automatically extracting structured information from the full text of books. Experimental evaluation of performance involves a corpus of books from Project Gutenberg. Results indicate that a classifier which combines methods and tools from IE and TC significantly improves over a state-of-the-art text classifier, achieving a classification performance of Fβ=1 = 0.8099.