Knowledge-based metadata extraction from PostScript files
DL '00 Proceedings of the fifth ACM conference on Digital libraries
Automatic metadata generation & evaluation
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Automatic document metadata extraction using support vector machines
Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Extracting structured data from Web pages
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Automatic discovery of logical document structure
Automatic discovery of logical document structure
DIAL '04 Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL'04)
Metaextract: an NLP system to automatically assign metadata
Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Automatic extraction of titles from general documents using machine learning
Information Processing and Management: an International Journal
Bibliographic Meta-Data Extraction Using Probabilistic Finite State Transducers
ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02
Hi-index | 0.00 |
Web documents are available in various forms, most of which do not carry additional semantics. This paper presents a model for general document metadata extraction. The model, which combines segmentation by keywords and pattern matching techniques, was implemented using PHP, MySQL, JavaScript and HTML. The system was tested with 40 randomly selected PDF documents (mainly theses). An evaluation of the system was done using standard criteria measures namely precision, recall, accuracy and F-measure. The results show that the model is relatively effective for the task of metadata extraction, especially for theses and dissertations. A combination of machine learning with these rule-based methods will be explored in the future for better results.