Text type structure and logical document structure

Authors:
Hagen Langer;Harald Lüngen;Petra Saskia Bayerl
Affiliations:
Justus-Liebig-Universität/Universität Osnabrück;Justus-Liebig-Universität, Gießen, Germany;Justus-Liebig-Universität, Gießen, Germany
Venue:
DiscAnnotation '04 Proceedings of the 2004 ACL Workshop on Discourse Annotation
Year:
2004

Citing 10
Cited 2

CiteSeer: an automatic citation indexing system

Proceedings of the third ACM conference on Digital libraries
Hierarchical classification of Web content

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
A classifier for semi-structured documents

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
DocBook: The Definitive Guide with CD-ROM

DocBook: The Definitive Guide with CD-ROM
Summarizing scientific articles: experiments with relevance and rhetorical status

Computational Linguistics - Summarization
Accuracy improvement of automatic text classification based on feature transformation

Proceedings of the 2003 ACM symposium on Document engineering
Methods for the semantic analysis of document markup

Proceedings of the 2003 ACM symposium on Document engineering
A non-projective dependency parser

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
A statistical model for domain-independent text segmentation

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Text-level structure of research papers: implications for text-based information processing systems

IRSG'97 Proceedings of the 19th Annual BCS-IRSG conference on Information Retrieval Research

Evaluating a meta-knowledge annotation scheme for bio-events

NeSp-NLP '10 Proceedings of the Workshop on Negation and Speculation in Natural Language Processing
Using structural information and citation evidence to detect significant plagiarism cases in scientific publications

Journal of the American Society for Information Science and Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most research on automated categorization of documents has concentrated on the assignment of one or many categories to a whole text. However, new applications, e.g. in the area of the Semantic Web, require a richer and more fine-grained annotation of documents, such as detailed thematic information about the parts of a document. Hence we investigate the automatic categorization of text segments of scientific articles with XML markup into 16 topic types from a text type structure schema. A corpus of 47 linguistic articles was provided with XML markup on different annotation layers representing text type structure, logical document structure, and grammatical categories. Six different feature extraction strategies were applied to this corpus and combined in various parametrizations in different classifiers. The aim was to explore the contribution of each type of information, in particular the logical structure features, to the classification accuracy. The results suggest that some of the topic types of our hierarchy are successfully learnable, while the features from the logical structure layer had no particular impact on the results.