Machine Learning
A maximum entropy approach to natural language processing
Computational Linguistics
Factorial Hidden Markov Models
Machine Learning - Special issue on learning with probabilistic representations
Knowledge-based metadata extraction from PostScript files
DL '00 Proceedings of the fifth ACM conference on Digital libraries
Information retrieval on the web
ACM Computing Surveys (CSUR)
Automatic metadata generation & evaluation
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
The Perceptron Algorithm with Uneven Margins
ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
Maximum Entropy Markov Models for Information Extraction and Segmentation
ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Metadata Based Web Mining for Relevance
IDEAS '00 Proceedings of the 2000 International Symposium on Database Engineering & Applications
A maximum entropy approach to information extraction from semi-structured and free text
Eighteenth national conference on Artificial intelligence
HICSS '98 Proceedings of the Thirty-First Annual Hawaii International Conference on System Sciences - Volume 2
Automatic document metadata extraction using support vector machines
Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Data and Metadata for Finding and Reminding
IV '99 Proceedings of the 1999 International Conference on Information Visualisation
Table extraction using conditional random fields
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
eBizSearch: a niche search engine for e-business
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
DIAL '04 Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL'04)
Statistical models for unsupervised prepositional phrase attachment
COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Metaextract: an NLP system to automatically assign metadata
Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Focused named entity recognition using machine learning
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Simple BM25 extension to multiple weighted fields
Proceedings of the thirteenth ACM international conference on Information and knowledge management
EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Columbia Newsblaster: multilingual news summarization on the web
HLT-NAACL--Demonstrations '04 Demonstration Papers at HLT-NAACL 2004
Improving Legal Document Summarization Using Graphical Models
Proceedings of the 2006 conference on Legal Knowledge and Information Systems: JURIX 2006: The Nineteenth Annual Conference
A General Learning Method for Automatic Title Extraction from HTML Pages
MLDM '09 Proceedings of the 6th International Conference on Machine Learning and Data Mining in Pattern Recognition
Automated document metadata extraction
Journal of Information Science
SciPlore Xtract: extracting titles from scientific PDF documents by analyzing style information
ECDL'10 Proceedings of the 14th European conference on Research and advanced technology for digital libraries
Automatic Organization and Generation of Presentation Slides for E-Learning
International Journal of Distance Education Technologies
A practical method for compatibility evaluation of portable document formats
ACIIDS'13 Proceedings of the 5th Asian conference on Intelligent Information and Database Systems - Volume Part II
Docear's PDF inspector: title extraction from PDF files
Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
How can catchy titles be generated without loss of informativeness?
Expert Systems with Applications: An International Journal
Hi-index | 0.00 |
In this paper, we propose a machine learning approach to title extraction from general documents. By general documents, we mean documents that can belong to any one of a number of specific genres, including presentations, book chapters, technical papers, brochures, reports, and letters. Previously, methods have been proposed mainly for title extraction from research papers. It has not been clear whether it could be possible to conduct automatic title extraction from general documents. As a case study, we consider extraction from Office including Word and PowerPoint. In our approach, we annotate titles in sample documents (for Word and PowerPoint, respectively) and take them as training data, train machine learning models, and perform title extraction using the trained models. Our method is unique in that we mainly utilize formatting information such as font size as features in the models. It turns out that the use of formatting information can lead to quite accurate extraction from general documents. Precision and recall for title extraction from Word are 0.810 and 0.837, respectively, and precision and recall for title extraction from PowerPoint are 0.875 and 0.895, respectively in an experiment on intranet data. Other important new findings in this work include that we can train models in one domain and apply them to other domains, and more surprisingly we can even train models in one language and apply them to other languages. Moreover, we can significantly improve search ranking results in document retrieval by using the extracted titles.