The Role of Different Thesauri Terms and Captions in Automated Subject Classification

Authors:
Koraljka Golub
Affiliations:
Lund University, Sweden
Venue:
WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
Year:
2006

Citing 4
Cited 1

Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Concept-matching IR systems versus word-matching information retrieval systems: Considering fuzzy interrelations for indexing Web pages: Special Topic Section on Soft Approaches to Information Retrieval and Information Access on the Web

Journal of the American Society for Information Science and Technology
Hierarchical document categorization with k-NN and concept-based thesauri

Information Processing and Management: an International Journal
Thesaurus based automatic keyphrase indexing

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries

Data mining of maps and their automatic region-time-theme classification

SIGSPATIAL Special

Quantified Score

Hi-index	0.00

Visualization

Abstract

The paper aims to explore to what degree different types of terms in Engineering Information (Ei) thesaurus and classification scheme influence automated subject classification performance. Preferred terms, their synonyms, broader, narrower, related terms, and captions are examined in combination with a stemmer and a stop-word list. The algorithm comprises string-to-string matching between words in the documents to be classified and words in term lists derived from the Ei thesaurus and classification scheme. The data collection for evaluation consists of some 35000 scientific paper abstracts from the Compendex database. A subset of the Ei thesaurus and classification scheme is used, comprising 92 classes at up to five hierarchical levels from General Engineering. The results show that preferred terms perform best, whereas captions perform worst. Stemming in most cases shows to improve performance, whereas the stop-word list does not have a significant impact.