Automatic document metadata extraction using support vector machines

Authors:
Hui Han;C. Lee Giles;Eren Manavoglu;Hongyuan Zha;Zhenyue Zhang;Edward A. Fox
Affiliations:
The Pennsylvania State University University Park, PA;The Pennsylvania State University University Park, PA;The Pennsylvania State University University Park, PA;The Pennsylvania State University University Park, PA;Zhejiang University, Yu-Quan Campus, Hangzhou 310027, P.R. China;Virginia Polytechnic Institute and State University, Blacksburg, VA
Venue:
Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Year:
2003

Citing 22
Cited 71

Interoperability for digital libraries worldwide

Communications of the ACM
Making metadata: a study of metadata creation for a mixed physical-digital collection

Proceedings of the third ACM conference on Digital libraries
Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
Distributional clustering of words for text classification

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Making large-scale support vector machine learning practical

Advances in kernel methods
An introduction to support Vector Machines: and other kernel-based learning methods

An introduction to support Vector Machines: and other kernel-based learning methods
A statistical learning learning model of text classification for support vector machines

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
CITIDEL: making resources available

Proceedings of the 7th annual conference on Innovation and technology in computer science education
Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Automating the Construction of Internet Portals with Machine Learning

Information Retrieval
Digital Libraries and Autonomous Citation Indexing

Computer
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Maximum Entropy Markov Models for Information Extraction and Segmentation

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
A maximum entropy approach to information extraction from semi-structured and free text

Eighteenth national conference on Artificial intelligence
eBizSearch: an OAI-compliant digital library for eBusiness

Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Federating heterogeneous digital libraries by metadata harvesting

Federating heterogeneous digital libraries by metadata harvesting
A divisive information theoretic feature clustering algorithm for text classification

The Journal of Machine Learning Research
Knowledge-free induction of inflectional morphologies

NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
Use of support vector learning for chunk identification

ConLL '00 Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning - Volume 7
Entity extraction without language-specific resources

COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
Use of support vector machines in extended named entity recognition

COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20

eBizSearch: an OAI-compliant digital library for eBusiness

Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Panorama: extending digital libraries with topical crawlers

Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Generating fuzzy semantic metadata describing spatial relations from images using the R-histogram

Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Metaextract: an NLP system to automatically assign metadata

Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Two supervised learning approaches for name disambiguation in author citations

Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Automatic extraction of titles from general documents using machine learning

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Finding a catalog: generating analytical catalog records from well-structured digital texts

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Developing practical automatic metadata assignment and evaluation tools for internet resources

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Name disambiguation in author citations using a K-way spectral clustering method

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Rule-based word clustering for document metadata extraction

Proceedings of the 2005 ACM symposium on Applied computing
As we may perceive: inferring logical documents from hypertext

Proceedings of the sixteenth ACM conference on Hypertext and hypermedia
Database-inspired search

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Automatic acknowledgement indexing: expanding the semantics of contribution in the CiteSeer digital library

Proceedings of the 3rd international conference on Knowledge capture
A new approach to intranet search based on information extraction

Proceedings of the 14th ACM international conference on Information and knowledge management
Automatic categorization of figures in scientific documents

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Information extraction from research papers using conditional random fields

Information Processing and Management: an International Journal
Automatic extraction of titles from general documents using machine learning

Information Processing and Management: an International Journal
Efficient optimization of support vector machine learning parameters for unbalanced datasets

Journal of Computational and Applied Mathematics
Reference metadata extraction using a hierarchical knowledge representation framework

Decision Support Systems
Web page title extraction and its application

Information Processing and Management: an International Journal
Integrating data and text mining processes for digital library applications

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
FLUX-CIM: flexible unsupervised extraction of citation metadata

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Towards automatic conceptual personalization tools

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Collaboration over time: characterizing and modeling network evolution

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Functionalities for automatic metadata generation applications: a survey of metadata experts' opinions

International Journal of Metadata, Semantics and Ontologies
Enabling ontology-based document classification and management in ebXML registries

Proceedings of the 2008 ACM symposium on Applied computing
A metadata generation system for scanned scientific volumes

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Using Data Mining Methods to Predict Personally Identifiable Information in Emails

ADMA '08 Proceedings of the 4th international conference on Advanced Data Mining and Applications
Private Data Discovery for Privacy Compliance in Collaborative Environments

CDVE '08 Proceedings of the 5th international conference on Cooperative Design, Visualization, and Engineering
Automatic Extraction of Pedagogic Metadata from Learning Content

International Journal of Artificial Intelligence in Education
Automatic metadata generation using associative networks

ACM Transactions on Information Systems (TOIS)
Automatic metadata extraction from museum specimen labels

DCMI '08 Proceedings of the 2008 International Conference on Dublin Core and Metadata Applications
CEBBIP: a parser of bibliographic information in chinese electronic books

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Paper Annotation with Learner Models

Proceedings of the 2005 conference on Artificial Intelligence in Education: Supporting Learning through Intelligent and Socially Informed Technology
A General Learning Method for Automatic Title Extraction from HTML Pages

MLDM '09 Proceedings of the 6th International Conference on Machine Learning and Data Mining in Pattern Recognition
Automated document metadata extraction

Journal of Information Science
Bridging the Gap between Linked Data and the Semantic Desktop

ISWC '09 Proceedings of the 8th International Semantic Web Conference
Automated template-based metadata extraction architecture

ICADL'07 Proceedings of the 10th international conference on Asian digital libraries: looking back 10 years and forging new frontiers
Using automatic metadata extraction to build a structured syllabus repository

ICADL'07 Proceedings of the 10th international conference on Asian digital libraries: looking back 10 years and forging new frontiers
Searching for ground truth: a stepping stone in automating genre classification

DELOS'07 Proceedings of the 1st international conference on Digital libraries: research and development
oreChem ChemXSeer: a semantic digital library for chemistry

Proceedings of the 10th annual joint conference on Digital libraries
SeerSuite: developing a scalable and reliable application framework for building digital libraries by crawling the web

WebApps'10 Proceedings of the 2010 USENIX conference on Web application development
Extracting formulaic and free text clinical research articles metadata using conditional random fields

Louhi '10 Proceedings of the NAACL HLT 2010 Second Louhi Workshop on Text and Data Mining of Health Documents
Keyphrases extraction from scientific documents: improving machine learning approaches with natural language processing

ICADL'10 Proceedings of the role of digital libraries in a time of global change, and 12th international conference on Asia-Pacific digital libraries
SciPlore Xtract: extracting titles from scientific PDF documents by analyzing style information

ECDL'10 Proceedings of the 14th European conference on Research and advanced technology for digital libraries
Automatic mining of cognitive metadata using fuzzy inference

Proceedings of the 22nd ACM conference on Hypertext and hypermedia
On identifying academic homepages for digital libraries

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
PDFMeat: managing publications on the semantic desktop

Proceedings of the 20th ACM international conference on Information and knowledge management
Efficient name disambiguation for large-scale databases

PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases
Semantic search in the World News domain using automatically extracted metadata files

Knowledge-Based Systems
Genre classification in automated ingest and appraisal metadata

ECDL'06 Proceedings of the 10th European conference on Research and Advanced Technology for Digital Libraries
Towards next generation citeseer: a flexible architecture for digital library deployment

ECDL'06 Proceedings of the 10th European conference on Research and Advanced Technology for Digital Libraries
iASA: learning to annotate the semantic web

Journal on Data Semantics IV
Header metadata extraction from semi-structured documents using template matching

OTM'06 Proceedings of the 2006 international conference on On the Move to Meaningful Internet Systems: AWeSOMe, CAMS, COMINF, IS, KSinBIT, MIOS-CIAO, MONET - Volume Part II
Automatic metadata mining from multilingual enterprise content

Web Semantics: Science, Services and Agents on the World Wide Web
Data mining with parallel support vector machines for classification

ADVIS'06 Proceedings of the 4th international conference on Advances in Information Systems
On line course organization

ICWL'07 Proceedings of the 6th international conference on Advances in web based learning
Digital Preservation in Grids and Clouds: A Middleware Approach

Journal of Grid Computing
Building a document genre corpus: a profile of the KRYS I corpus

IRSG'08 Proceedings of the 2008 BCS-IRSG conference on Corpus Profiling
AckSeer: a repository and search engine for automatically extracted acknowledgments from digital libraries

Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries
A hybrid two-stage approach for discipline-independent canonical representation extraction from references

Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries
Web-based citation parsing, correction and augmentation

Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries
A comparison of metadata extraction techniques for crowdsourced bibliographic metadata management

Proceedings of the 27th Annual ACM Symposium on Applied Computing
A comparison of layout based bibliographic metadata extraction techniques

Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics
Self-supervised learning approach for extracting citation information on the web

APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications
Content independent metadata production as a machine learning problem

MLDM'12 Proceedings of the 8th international conference on Machine Learning and Data Mining in Pattern Recognition
Assessing quality dynamics in unsupervised metadata extraction for digital libraries

ECDL'07 Proceedings of the 11th European conference on Research and Advanced Technology for Digital Libraries
Logical Structure Recovery in Scholarly Articles with Rich Document Features

International Journal of Digital Library Systems
Extracting and matching authors and affiliations in scholarly documents

Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
Docear's PDF inspector: title extraction from PDF files

Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
Searching online book documents and analyzing book citations

Proceedings of the 2013 ACM symposium on Document engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Automatic metadata generation provides scalability and usability for digital libraries and their collections. Machine learning methods offer robust and adaptable automatic metadata extraction. We describe a Support Vector Machine classification-based method for metadata extraction from header part of research papers and show that it outperforms other machine learning methods on the same task. The method first classifies each line of the header into one or more of 15 classes. An iterative convergence procedure is then used to improve the line classification by using the predicted class labels of its neighbor lines in the previous round. Further metadata extraction is done by seeking the best chunk boundaries of each line. We found that discovery and use of the structural patterns of the data and domain based word clustering can improve the metadata extraction performance. An appropriate feature normalization also greatly improves the classification performance. Our metadata extraction method was originally designed to improve the metadata extraction quality of the digital libraries Citeseer [17] and EbizSearch[24]. We believe it can be generalized to other digital libraries.