Knowledge-based metadata extraction from PostScript files

Authors:
Giovanni Giuffrida;Eddie C. Shek;Jihoon Yang
Affiliations:
HRL Laboratories, LLC, 3011 Malibu Canyon Road, Malibu, CA;HRL Laboratories, LLC, 3011 Malibu Canyon Road, Malibu, CA;HRL Laboratories, LLC, 3011 Malibu Canyon Road, Malibu, CA
Venue:
DL '00 Proceedings of the fifth ACM conference on Digital libraries
Year:
2000

Citing 2
Cited 25

Digital libraries and knowledge disaggregation: the use of journal article components

Proceedings of the third ACM conference on Digital libraries
CiteSeer: an automatic citation indexing system

Proceedings of the third ACM conference on Digital libraries

Who can claim complete abstinence from peeking at print jobs?

CSCW '02 Proceedings of the 2002 ACM conference on Computer supported cooperative work
A document corpus browser for in-depth reading

Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Generating fuzzy semantic metadata describing spatial relations from images using the R-histogram

Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Automatic extraction of titles from general documents using machine learning

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Finding a catalog: generating analytical catalog records from well-structured digital texts

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Automatic extraction of titles from general documents using machine learning

Information Processing and Management: an International Journal
Web page title extraction and its application

Information Processing and Management: an International Journal
A metadata generation system for scanned scientific volumes

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Extracting the author of web pages

Proceedings of the 2nd ACM workshop on Information credibility on the web
Automatic metadata generation using associative networks

ACM Transactions on Information Systems (TOIS)
Automatic metadata generation applications: a survey study

International Journal of Metadata, Semantics and Ontologies
Automatically generating high quality metadata by analyzing the document code of common file types

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Automated document metadata extraction

Journal of Information Science
Identifying Information Sender Configuration of Web Pages

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Bridging the Gap between Linked Data and the Semantic Desktop

ISWC '09 Proceedings of the 8th International Semantic Web Conference
Searching for ground truth: a stepping stone in automating genre classification

DELOS'07 Proceedings of the 1st international conference on Digital libraries: research and development
Evidence-based information extraction for high accuracy citation and author name identification

Large Scale Semantic Access to Content (Text, Image, Video, and Sound)
Genre classification in automated ingest and appraisal metadata

ECDL'06 Proceedings of the 10th European conference on Research and Advanced Technology for Digital Libraries
Semantic scoring based on small-world phenomenon for feature selection in text mining

ADMA'06 Proceedings of the Second international conference on Advanced Data Mining and Applications
Header metadata extraction from semi-structured documents using template matching

OTM'06 Proceedings of the 2006 international conference on On the Move to Meaningful Internet Systems: AWeSOMe, CAMS, COMINF, IS, KSinBIT, MIOS-CIAO, MONET - Volume Part II
Semantic metadata models in references sharing and retrieval system semrex

GPC'06 Proceedings of the First international conference on Advances in Grid and Pervasive Computing
Building a document genre corpus: a profile of the KRYS I corpus

IRSG'08 Proceedings of the 2008 BCS-IRSG conference on Corpus Profiling
Content independent metadata production as a machine learning problem

MLDM'12 Proceedings of the 8th international conference on Machine Learning and Data Mining in Pattern Recognition
Automatic generation of SCORM compliant metadata for portable document format files

Proceedings of the 13th International Conference on Computer Systems and Technologies
Determining the titles of Web pages using anchor text and link analysis

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

The automatic document metadata extraction process is animportant task in a world where thousands of documents are just one``click'' away. Thus, powerful indices are necessary to support effective retrieval. The upcoming XML standard represents an important step in this direction as itssemistructuredrepresentation conveys document metadata together with the text of the document. For example, retrieval of scientific papers by authors or affiliations would be a straightforward tasks if papers were stored in XML.Unfortunately, today, the largest majority of documents on the web are available in forms that do not carryadditional semantics. Converting existing documents to a semistructured representation is time consuming and no automatic process can be easily applied. In this paper we discuss a system, based on a novel spatial/visualknowledge principle, for extracting metadata from scientific papers storedas PostScript files. Our system embeds the general knowledge about the graphical layout of a scientific paper to guide the metadata extraction process. Our system can effectively assist the automatic index creation for digital libraries.