Logical Structure Recovery in Scholarly Articles with Rich Document Features

Authors:
Min-Yen Kan;Minh-Thang Luong;Thuy Dung Nguyen
Affiliations:
National University of Singapore, Singapore;National University of Singapore, Singapore;National University of Singapore, Singapore
Venue:
International Journal of Digital Library Systems
Year:
2010

Citing 10
Cited 4

Foundations of statistical natural language processing

Foundations of statistical natural language processing
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Automatic document metadata extraction using support vector machines

Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Bibliographic attribute extraction from erroneous references based on a statistical model

Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
INFTY: an integrated OCR system for mathematical documents

Proceedings of the 2003 ACM symposium on Document engineering
ArnetMiner: extraction and mining of academic social networks

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Syntactic Detection and Correction of Misrecognitions in Mathematical OCR

ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition
Keyphrase extraction in scientific publications

ICADL'07 Proceedings of the 10th international conference on Asian digital libraries: looking back 10 years and forging new frontiers
Towards next generation citeseer: a flexible architecture for digital library deployment

ECDL'06 Proceedings of the 10th European conference on Research and Advanced Technology for Digital Libraries
Document logical structure analysis based on perceptive cycles

DAS'06 Proceedings of the 7th international conference on Document Analysis Systems

Linking citations to their bibliographic references

ACL '12 Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries
Extracting and matching authors and affiliations in scholarly documents

Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
Recognising document components in XML-based academic articles

Proceedings of the 2013 ACM symposium on Document engineering
Towards machine-actionable modules of a digital mathematics library: the example of DML-CZ

CICM'13 Proceedings of the 2013 international conference on Intelligent Computer Mathematics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Scholarly digital libraries increasingly provide analytics to information within documents themselves. This includes information about the logical document structure of use to downstream components, such as search, navigation, and summarization. In this paper, the authors describe SectLabel, a module that further develops existing software to detect the logical structure of a document from existing PDF files, using the formalism of conditional random fields. While previous work has assumed access only to the raw text representation of the document, a key aspect of this work is to integrate the use of a richer representation of the document that includes features from optical character recognition OCR, such as font size and text position. Experiments reveal that using such rich features improves logical structure detection by a significant 9 F1 points, over a suitable baseline, motivating the use of richer document representations in other digital library applications.