The utility of information extraction in the classification of books

Authors:
Tom Betts;Maria Milosavljevic;Jon Oberlander
Affiliations:
School of Informatics, University of Edinburgh, Edinburgh, UK;School of Informatics, University of Edinburgh, Edinburgh, UK;School of Informatics, University of Edinburgh, Edinburgh, UK
Venue:
ECIR'07 Proceedings of the 29th European conference on IR research
Year:
2007

Citing 8
Cited 2

A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Content-based book recommending using learning for text categorization

DL '00 Proceedings of the fifth ACM conference on Digital libraries
A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization

Text databases & document management
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Feature Engineering for Text Classification

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Predicting library of congress classifications from library of congress subject headings

Journal of the American Society for Information Science and Technology
Language independent NER using a maximum entropy tagger

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4

A scalable assistant librarian: hierarchical subject classification of books

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Extracting useful information from the full text of fiction

Large Scale Semantic Access to Content (Text, Image, Video, and Sound)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe work on automatically assigning classification labels to books using the Library of Congress Classification scheme. This task is non-trivial due to the volume and variety of books that exist. We explore the utility of Information Extraction (IE) techniques within this text categorisation (TC) task, automatically extracting structured information from the full text of books. Experimental evaluation of performance involves a corpus of books from Project Gutenberg. Results indicate that a classifier which combines methods and tools from IE and TC significantly improves over a state-of-the-art text classifier, achieving a classification performance of Fβ=1 = 0.8099.