CiteSeer: an automatic citation indexing system
Proceedings of the third ACM conference on Digital libraries
Hierarchical classification of Web content
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
A classifier for semi-structured documents
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
DocBook: The Definitive Guide with CD-ROM
DocBook: The Definitive Guide with CD-ROM
Summarizing scientific articles: experiments with relevance and rhetorical status
Computational Linguistics - Summarization
Accuracy improvement of automatic text classification based on feature transformation
Proceedings of the 2003 ACM symposium on Document engineering
Methods for the semantic analysis of document markup
Proceedings of the 2003 ACM symposium on Document engineering
A non-projective dependency parser
ANLC '97 Proceedings of the fifth conference on Applied natural language processing
A statistical model for domain-independent text segmentation
ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Text-level structure of research papers: implications for text-based information processing systems
IRSG'97 Proceedings of the 19th Annual BCS-IRSG conference on Information Retrieval Research
Evaluating a meta-knowledge annotation scheme for bio-events
NeSp-NLP '10 Proceedings of the Workshop on Negation and Speculation in Natural Language Processing
Journal of the American Society for Information Science and Technology
Hi-index | 0.00 |
Most research on automated categorization of documents has concentrated on the assignment of one or many categories to a whole text. However, new applications, e.g. in the area of the Semantic Web, require a richer and more fine-grained annotation of documents, such as detailed thematic information about the parts of a document. Hence we investigate the automatic categorization of text segments of scientific articles with XML markup into 16 topic types from a text type structure schema. A corpus of 47 linguistic articles was provided with XML markup on different annotation layers representing text type structure, logical document structure, and grammatical categories. Six different feature extraction strategies were applied to this corpus and combined in various parametrizations in different classifiers. The aim was to explore the contribution of each type of information, in particular the logical structure features, to the classification accuracy. The results suggest that some of the topic types of our hierarchy are successfully learnable, while the features from the logical structure layer had no particular impact on the results.