The nature of statistical learning theory
The nature of statistical learning theory
Distributional clustering of words for text classification
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Document clustering with committees
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Automating the Construction of Internet Portals with Machine Learning
Information Retrieval
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Mining the peanut gallery: opinion extraction and semantic classification of product reviews
WWW '03 Proceedings of the 12th international conference on World Wide Web
Automatic document metadata extraction using support vector machines
Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Bibliographic attribute extraction from erroneous references based on a statistical model
Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Improving Category Specific Web Search by Learning Query Modifications
SAINT '01 Proceedings of the 2001 Symposium on Applications and the Internet (SAINT 2001)
A divisive information theoretic feature clustering algorithm for text classification
The Journal of Machine Learning Research
Nymble: a high-performance learning name-finder
ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Distributional clustering of English words
ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Knowledge-free induction of inflectional morphologies
NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
Probabilistic latent semantic analysis
UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence
Automatic metadata extraction from museum specimen labels
DCMI '08 Proceedings of the 2008 International Conference on Dublin Core and Metadata Applications
Automated template-based metadata extraction architecture
ICADL'07 Proceedings of the 10th international conference on Asian digital libraries: looking back 10 years and forging new frontiers
WebApps'10 Proceedings of the 2010 USENIX conference on Web application development
A comparison of metadata extraction techniques for crowdsourced bibliographic metadata management
Proceedings of the 27th Annual ACM Symposium on Applied Computing
A comparison of layout based bibliographic metadata extraction techniques
Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics
Hi-index | 0.00 |
Text classification is still an important problem for unlabeled text; CiteSeer, a computer science document search engine, uses automatic text classification methods for document indexing. Text classification uses a document's original text words as the primary feature representation. However, such representation usually comes with high dimensionality and feature sparseness. Word clustering is an effective approach to reduce feature dimensionality and feature sparseness, and improve text classification performance. This paper introduces a domain Rule-based word clustering method for cluster feature representation. The clusters are formed from various domain databases and the word orthographic properties. Besides significant dimensionality reduction, such cluster feature representations show a 6.6% absolute improvement on average on classification performance of document header lines and a 8.4% absolute improvement on the overall accuracy of bibliographic fields extraction, in contrast to feature representation just based on the original text words. Our word clustering even outperforms the distributional word clustering in the context of document metadata extraction.