Knowledge management and XML: derivation of synthetic views over semi-structured data

Authors:
Mario Cannataro;Antonella Guzzo;Andrea Pugliese
Affiliations:
ISI Institute, National Research Council, Via P.Bucci, 41/C, Rende, Italy;University of Calabria, Via P.Bucci, Rende, Italy;University of Calabria, Via P.Bucci, Rende, Italy
Venue:
ACM SIGAPP Applied Computing Review
Year:
2002

Citing 9
Cited 1

Principles of database and knowledge-base systems, Vol. I

Principles of database and knowledge-base systems, Vol. I
Data compression

ACM Computing Surveys (CSUR)
Database compression

ACM SIGMOD Record
Semistructured data

PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Data on the Web: from relations to semistructured data and XML

Data on the Web: from relations to semistructured data and XML
XMill: an efficient compressor for XML data

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Data mining: concepts and techniques

Data mining: concepts and techniques
Querying Semi-Structured Data

ICDT '97 Proceedings of the 6th International Conference on Database Theory
Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases

A framework for abstracting data sources having heterogeneous representation formats

Data & Knowledge Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

One of the effects of the expansion of the World Wide Web is theproduction of a huge amount of data, differentiated for type,available to a large number of different users. Furthermore, theconstant progress of computer hardware technology in the past threedecades has led to the availability of powerful computers, datacollection equipments, and storage media; this technology providesa great boost to the database and information industry by allowingtransaction management, information retrieval, and data analysisover massive amounts of heterogeneous data. Moreover, the explosionof Internet increases the availability of data in differentformats: structured (e.g. relational), semistructured (e.g. HTML,XML) and unstructured (e.g. plain text, audio/video) data [2].Thus, new data management systems, able to take advantage of theseheterogeneous data, are emerging and will play a vital role in theinformation industry. Thus, heterogeneous database systems emergeand play a vital role in the information industry.Knowledge Management is concerned with the technological,economic and organizational aspects related to (i) thecreation, distribution, diversification and sharing of knowledge incomplex organizations and to (ii) the management ofinformative flows, processes and interactions with externalKnowledge [8].Figure 1 summarizes the steps (each represented on a differentlevel of the pyramid) through which knowledge is typicallyextracted from basic data. The first three levels regard themanagement of explicit knowledge (i.e. codified, structuredor semistructured and completely available). In particular,starting from the bottom, the first level is concerned with storingand exchanging "factual" knowledge, essentially corresponding tobasic data. Technologies used here comprise Databases [17],Data Repositories, Archive Sharing tools and the emergingExtensible Markup Language (XML) [18].The second level regards "conceptual knowledge" modeling, i.e.the definition of concepts and relationships among them. Suchknowledge is typically represented by means of diagram-basedformalisms for both information and related processes [9]. TheUnified Modeling Language (UML) is currently one of the mostpromising modeling languages, oriented towards thespecification,implementation and documentation of complex softwaresystems, but also used for modeling company processes not strictlyrelated to the software.The third level is concerned with organization and integrationof information represented according to heterogeneous formalisms.Techniques used here are essentially those concerning DataWarehousing (DW) [10]. Data warehouses are integratedrepositories of data extracted from multiple heterogeneous sources,organized under a unified schema and at a single site, in order tofacilitate management and decision making. Data Warehousingtechnologies include data cleaning, data integration, and OnlineAnalytical Processing (OLAP), i.e. analysis techniques based onaggregation and summarization.The highest level regards Knowledge Discovery, i.e. theuncovering of new, implicit and potentially usefulknowledge from large amounts of data. The core phase of knowledgediscovery is Data Mining [10], an interactive, iterative,multi-step process, comprising in particular pattern searching andeventual refinements on the basis of domain experts' knowledge.In the context of explicit knowledge management, the ExtensibleMarkup Language takes naturally place. XML is a language forsemistructured data [1, 5] of the World Wide Web Consortium(W3C) [13] which is designed to allow marking, transferring andreusing information by means of a standard method of definition ofthe documents structure and format. Its metalanguage features havebeen used in knowledge management typically for (i) thesemi-automatic production of documents, (ii) the reuse ofsemistructured information and its integration in heterogeneoussystems, (iii) the creation of knowledge maps for theorganization and sharing of information.The increasing quantity of available semistructured data and theuse of XML for their description and exchange discovers newreaserch themes related to management and knowledge extraction overXML data. In this scenario, our proposal consists of a system forthe syntesization of XML documents that attempts to extracttheir semantics and to derive synthetic versions of them by meansof a multidimensional interpretation [10]. In the contest ofKnowledge Management, data synthesization can be regarded as a newway for knowledge extraction, by discovering and aggregating(useful) core information and by neglecting (useless) details.