An Evaluation of Passage-Based Text Categorization

Authors:
Jinsuk Kim;Myoung Ho Kim
Affiliations:
Center for Computational Biology & Bioinformatics, Korea Institute of Science and Technology Information, P.O. Box 122, Yuseong-gu, Daejeon, Republic of Korea 305-600. jinsuk@kisti.re. ...;Department of Electrical Engineering & Computer Science, Korea Advanced Institute of Science and Technology, 373-1, Guseong-dong, Yuseong-gu, Daejeon, Republic of Korea 305-701. mhkim@ ...
Venue:
Journal of Intelligent Information Systems
Year:
2004

Citing 18
Cited 3

Approaches to passage retrieval in full text information systems

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Subtopic structuring for full-length document access

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Expert network: effective and efficient learning from human decisions in text categorization and retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Towards language independent automated learning of text categorization models

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Passage-level evidence in document retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Efficient retrieval of partial documents

TREC-2 Proceedings of the second conference on Text retrieval conference
Combining classifiers in text categorization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Distributional clustering of words for text classification

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Efficient passage ranking for document databases

ACM Transactions on Information Systems (TOIS)
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
Effective ranking with arbitrary passages

Journal of the American Society for Information Science and Technology
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Information Retrieval

Information Retrieval
A Study of Approaches to Hypertext Categorization

Journal of Intelligent Information Systems
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Multi-paragraph segmentation of expository text

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
A belief networks-based generative model for structured documents: an application to the XML categorization

MLDM'03 Proceedings of the 3rd international conference on Machine learning and data mining in pattern recognition

A multi-level matching method with hybrid similarity for document retrieval

Expert Systems with Applications: An International Journal
Investigating usage of text segmentation and inter-passage similarities to improve text document clustering

MLDM'12 Proceedings of the 8th international conference on Machine Learning and Data Mining in Pattern Recognition
Reduction of training noises for text classifiers

ACIIDS'13 Proceedings of the 5th Asian conference on Intelligent Information and Database Systems - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

Researches in text categorization have been confined to whole-document-level classification, probably due to lack of full-text test collections. However, full-length documents available today in large quantities pose renewed interests in text classification. A document is usually written in an organized structure to present its main topic(s). This structure can be expressed as a sequence of subtopic text blocks, or passages. In order to reflect the subtopic structure of a document, we propose a new passage-level or passage-based text categorization model, which segments a test document into several passages, assigns categories to each passage, and merges the passage categories to the document categories. Compared with traditional document-level categorization, two additional steps, passage splitting and category merging, are required in this model. Using four subsets of the Reuters text categorization test collection and a full-text test collection of which documents are varying from tens of kilobytes to hundreds, we evaluate the proposed model, especially the effectiveness of various passage types and the importance of passage location in category merging. Our results show simple windows are best for all test collections tested in these experiments. We also found that passages have different degrees of contribution to the main topic(s), depending on their location in the test document.