Foundations of statistical natural language processing
Foundations of statistical natural language processing
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Automatic document metadata extraction using support vector machines
Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Bibliographic attribute extraction from erroneous references based on a statistical model
Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
INFTY: an integrated OCR system for mathematical documents
Proceedings of the 2003 ACM symposium on Document engineering
ArnetMiner: extraction and mining of academic social networks
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Syntactic Detection and Correction of Misrecognitions in Mathematical OCR
ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition
Keyphrase extraction in scientific publications
ICADL'07 Proceedings of the 10th international conference on Asian digital libraries: looking back 10 years and forging new frontiers
Towards next generation citeseer: a flexible architecture for digital library deployment
ECDL'06 Proceedings of the 10th European conference on Research and Advanced Technology for Digital Libraries
Document logical structure analysis based on perceptive cycles
DAS'06 Proceedings of the 7th international conference on Document Analysis Systems
Linking citations to their bibliographic references
ACL '12 Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries
Extracting and matching authors and affiliations in scholarly documents
Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
Recognising document components in XML-based academic articles
Proceedings of the 2013 ACM symposium on Document engineering
Towards machine-actionable modules of a digital mathematics library: the example of DML-CZ
CICM'13 Proceedings of the 2013 international conference on Intelligent Computer Mathematics
Hi-index | 0.00 |
Scholarly digital libraries increasingly provide analytics to information within documents themselves. This includes information about the logical document structure of use to downstream components, such as search, navigation, and summarization. In this paper, the authors describe SectLabel, a module that further develops existing software to detect the logical structure of a document from existing PDF files, using the formalism of conditional random fields. While previous work has assumed access only to the raw text representation of the document, a key aspect of this work is to integrate the use of a richer representation of the document that includes features from optical character recognition OCR, such as font size and text position. Experiments reveal that using such rich features improves logical structure detection by a significant 9 F1 points, over a suitable baseline, motivating the use of richer document representations in other digital library applications.