Automatic extraction of titles from general documents using machine learning

Authors:
Yunhua Hu;Hang Li;Yunbo Cao;Dmitriy Meyerzon;Qinghua Zheng
Affiliations:
Xi'an Jiaotong University, Xi'an, China;Microsoft Research Asia, Beijing, China;Microsoft Research Asia, Beijing, China;Microsoft Corporation, Redmond, WA;Xi'an Jiaotong University, Xi'an, China
Venue:
Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Year:
2005

Citing 23
Cited 10

Support-Vector Networks

Machine Learning
A maximum entropy approach to natural language processing

Computational Linguistics
Factorial Hidden Markov Models

Machine Learning - Special issue on learning with probabilistic representations
Knowledge-based metadata extraction from PostScript files

DL '00 Proceedings of the fifth ACM conference on Digital libraries
Information retrieval on the web

ACM Computing Surveys (CSUR)
Automatic metadata generation & evaluation

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
The Perceptron Algorithm with Uneven Margins

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
Maximum Entropy Markov Models for Information Extraction and Segmentation

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Metadata Based Web Mining for Relevance

IDEAS '00 Proceedings of the 2000 International Symposium on Database Engineering & Applications
A maximum entropy approach to information extraction from semi-structured and free text

Eighteenth national conference on Artificial intelligence
Digital Document Metadata in Organizations: Roles, Analytical Approaches, and Future Research Directions

HICSS '98 Proceedings of the Thirty-First Annual Hawaii International Conference on System Sciences - Volume 2
Automatic document metadata extraction using support vector machines

Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Data and Metadata for Finding and Reminding

IV '99 Proceedings of the 1999 International Conference on Information Visualisation
Table extraction using conditional random fields

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
eBizSearch: a niche search engine for e-business

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
A Dynamic Feature Generation System for Automated Metadata Extraction in Preservation of Digital Materials

DIAL '04 Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL'04)
Statistical models for unsupervised prepositional phrase attachment

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Metaextract: an NLP system to automatically assign metadata

Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Focused named entity recognition using machine learning

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Simple BM25 extension to multiple weighted fields

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Columbia Newsblaster: multilingual news summarization on the web

HLT-NAACL--Demonstrations '04 Demonstration Papers at HLT-NAACL 2004

A new approach to intranet search based on information extraction

Proceedings of the 14th ACM international conference on Information and knowledge management
Automatic categorization of figures in scientific documents

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
FLUX-CIM: flexible unsupervised extraction of citation metadata

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
A metadata generation system for scanned scientific volumes

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Automatic metadata extraction from museum specimen labels

DCMI '08 Proceedings of the 2008 International Conference on Dublin Core and Metadata Applications
Evaluation of an integrated multi-task machine learning system with humans in the loop

PerMIS '07 Proceedings of the 2007 Workshop on Performance Metrics for Intelligent Systems
Survey measures for evaluation of cognitive assistants

PerMIS '07 Proceedings of the 2007 Workshop on Performance Metrics for Intelligent Systems
Header metadata extraction from semi-structured documents using template matching

OTM'06 Proceedings of the 2006 international conference on On the Move to Meaningful Internet Systems: AWeSOMe, CAMS, COMINF, IS, KSinBIT, MIOS-CIAO, MONET - Volume Part II
TitleFinder: extracting the headline of news web pages based on cosine similarity and overlap scoring similarity

Proceedings of the twelfth international workshop on Web information and data management
Searching online book documents and analyzing book citations

Proceedings of the 2013 ACM symposium on Document engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we propose a machine learning approach to title extraction from general documents. By general documents, we mean documents that can belong to any one of a number of specific genres, including presentations, book chapters, technical papers, brochures, reports, and letters. Previously, methods have been proposed mainly for title extraction from research papers. It has not been clear whether it could be possible to conduct automatic title extraction from general documents. As a case study, we consider extraction from Office including Word and PowerPoint. In our approach, we annotate titles in sample documents (for Word and PowerPoint respectively) and take them as training data, train machine learning models, and perform title extraction using the trained models. Our method is unique in that we mainly utilize formatting information such as font size as features in the models. It turns out that the use of formatting information can lead to quite accurate extraction from general documents. Precision and recall for title extraction from Word is 0.810 and 0.837 respectively, and precision and recall for title extraction from PowerPoint is 0.875 and 0.895 respectively in an experiment on intranet data. Other important new findings in this work include that we can train models in one domain and apply them to another domain, and more surprisingly we can even train models in one language and apply them to another language. Moreover, we can significantly improve search ranking results in document retrieval by using the extracted titles.