Gathering Metadata from Web-Based Repositories of Historical Publications

Authors:
Ismael Sanz;Rafael Berlanga;María José Aramburu
Affiliations:
-;-;-
Venue:
DEXA '98 Proceedings of the 9th International Workshop on Database and Expert Systems Applications
Year:
1998

Citing 0
Cited 4

Efficient Retrieval of Structured Documents From Object-Relational Databases

DEXA '99 Proceedings of the 10th International Conference on Database and Expert Systems Applications
Intelligent knowledge extraction from the web

International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems - Intelligent information systems
Identifying ontology components from digital archives for the semantic web

ACST'06 Proceedings of the 2nd IASTED international conference on Advances in computer science and technology
Contextualizing data warehouses with documents

Decision Support Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Building digital libraries from Internet-accessible document repositories is a challenging task, due to the current mismatch between the desired DBMS-like capabilities of the former and the schemaless HTML files stored in web sites. In order to address this problem, we propose a distributed architecture for the extraction of metadata from WWW documents specially suited for repositories of historical publications, like newspapers. In this paper we present an information extraction system based on semi-structured data analysis. Starting from several combinations of the HTML styles that abstract the visual characteristics of documents, the proposed system infers the logical structure and attributes of HTML texts. Additionally, by using context-free grammars the system extracts the overall web structure of the repositories. The system output is a metadata object that contains a concise representation of the corresponding publication and its components.