Exploring a new space of features for document classification: figure clustering

Authors:
Nawei Chen;Hagit Shatkay;Dorothea Blostein
Affiliations:
Queen's University, Kingston, Ontario, Canada;Queen's University, Kingston, Ontario, Canada;Queen's University, Kingston, Ontario, Canada
Venue:
CASCON '06 Proceedings of the 2006 conference of the Center for Advanced Studies on Collaborative research
Year:
2006

Citing 3
Cited 2

Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Use of Figures in Literature Mining for Biomedical Digital Libraries

DIAL '06 Proceedings of the Second International Conference on Document Image Analysis for Libraries
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

A novel efficient classification algorithm for search engines

AIC'08 Proceedings of the 8th conference on Applied informatics and communications
OCR-based image features for biomedical image and article classification: identifying documents relevant to cis-regulatory elements

Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine

Quantified Score

Hi-index	0.00

Visualization

Abstract

Automatic document classification is an important step in organizing and mining documents. Information in documents is often conveyed using both text and images that complement each other. Typically, only the text content forms the basis for features that are used in document classification. In this paper, we explore the use of information from figure images to assist in this task. We explore image clustering as a basis for constructing visual words for representing documents. Once such visual words are formed, the standard bag-of-words representation along with commonly used classifiers, such as the naïve Bayes, can be used to classify a document. We report here results from classifying biomedical documents that were previously used in the TREC Genomics track, employing the image-based representation. Efforts are ongoing to improve image-based classification and analyze the relationships between text and images. The goal is to develop a new set of features to supplement current text-based features.