Preferred Document Classification for a Highly Inflectional/Derivational Language

Authors:
Kyongho Min;William H. Wilson;Yoo-Jin Moon
Affiliations:
-;-;-
Venue:
AI '02 Proceedings of the 15th Australian Joint Conference on Artificial Intelligence: Advances in Artificial Intelligence
Year:
2002

Citing 12
Cited 0

Automated learning of decision rules for text categorization

ACM Transactions on Information Systems (TOIS)
Natural language processing for information retrieval

Communications of the ACM
Automatic text structuring and summarization

Information Processing and Management: an International Journal - Special issue: methods and tools for the automatic construction of hypertext
Context-sensitive learning methods for text categorization

ACM Transactions on Information Systems (TOIS)
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Algorithms for the Longest Common Subsequence Problem

Journal of the ACM (JACM)
Automatic classification using supervised learning in a medical document filtering application

Information Processing and Management: an International Journal
A document classification method by using field association words

Information Sciences—Informatics and Computer Science: An International Journal
An efficient context-free parsing algorithm

Communications of the ACM
Evaluating combinations of ranked lists and visualizations of inter-document similarity

Information Processing and Management: an International Journal - Special issue on interactivity at the text retrieval conference (TREC)
Natural language analysis for semantic document modeling

Data & Knowledge Engineering
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes methods of document classification for a highly inflectional/derivational language that forms monolithic compound noun terms, like Dutch and Korean. The system is composed of three phases: (1) a Korean morphological analyzer called HAM (Kang, 1993), (2) an application of compound noun phrase analysis to the result of HAM analysis and extraction of terms whose syntactic categories are noun, name (proper noun), verb, and adjective, and (3) an effective document classification algorithm based on preferred class score heuristics. This paper focuses on the comparison of document classification methods including a simple heuristic method, and preferred class score heuristics employing two factors namely ICF (inverted class frequency) and IDF (inverted document frequency) with/without term frequency weight. In addition this paper describes a simple classification approach without a learning algorithm rather than a vector space model with a complex training and classification algorithm such as cosine similarity measurement. The experimental results show 95.7% correct classifications of 720 training data and 63.8%-71.3% of randomly chosen 80 testing data through various methods.