Rule-based word clustering for document metadata extraction

Authors:
Hui Han;Eren Manavoglu;Hongyuan Zha;Kostas Tsioutsiouliklis;C. Lee Giles;Xiangmin Zhang
Affiliations:
Yahoo Inc., Sunnyvale, CA;The Pennsylvania State University, PA;The Pennsylvania State University, PA;Yahoo Inc., Sunnyvale, CA;The Pennsylvania State University, PA;Rutgers University, New Brunswick, NJ
Venue:
Proceedings of the 2005 ACM symposium on Applied computing
Year:
2005

Citing 14
Cited 5

The nature of statistical learning theory

The nature of statistical learning theory
Distributional clustering of words for text classification

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Document clustering with committees

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Automating the Construction of Internet Portals with Machine Learning

Information Retrieval
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Mining the peanut gallery: opinion extraction and semantic classification of product reviews

WWW '03 Proceedings of the 12th international conference on World Wide Web
Automatic document metadata extraction using support vector machines

Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Bibliographic attribute extraction from erroneous references based on a statistical model

Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Improving Category Specific Web Search by Learning Query Modifications

SAINT '01 Proceedings of the 2001 Symposium on Applications and the Internet (SAINT 2001)
A divisive information theoretic feature clustering algorithm for text classification

The Journal of Machine Learning Research
Nymble: a high-performance learning name-finder

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Distributional clustering of English words

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Knowledge-free induction of inflectional morphologies

NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
Probabilistic latent semantic analysis

UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence

Automatic metadata extraction from museum specimen labels

DCMI '08 Proceedings of the 2008 International Conference on Dublin Core and Metadata Applications
Automated template-based metadata extraction architecture

ICADL'07 Proceedings of the 10th international conference on Asian digital libraries: looking back 10 years and forging new frontiers
SeerSuite: developing a scalable and reliable application framework for building digital libraries by crawling the web

WebApps'10 Proceedings of the 2010 USENIX conference on Web application development
A comparison of metadata extraction techniques for crowdsourced bibliographic metadata management

Proceedings of the 27th Annual ACM Symposium on Applied Computing
A comparison of layout based bibliographic metadata extraction techniques

Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Text classification is still an important problem for unlabeled text; CiteSeer, a computer science document search engine, uses automatic text classification methods for document indexing. Text classification uses a document's original text words as the primary feature representation. However, such representation usually comes with high dimensionality and feature sparseness. Word clustering is an effective approach to reduce feature dimensionality and feature sparseness, and improve text classification performance. This paper introduces a domain Rule-based word clustering method for cluster feature representation. The clusters are formed from various domain databases and the word orthographic properties. Besides significant dimensionality reduction, such cluster feature representations show a 6.6% absolute improvement on average on classification performance of document header lines and a 8.4% absolute improvement on the overall accuracy of bibliographic fields extraction, in contrast to feature representation just based on the original text words. Our word clustering even outperforms the distributional word clustering in the context of document metadata extraction.