The DIAsDEM Framework for Converting Domain-Specific Texts into XML Documents with Data Mining Techniques

Authors:
Henner Graubitz;Myra Spiliopoulou;Karsten Winkler
Affiliations:
-;-;-
Venue:
ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Year:
2001

Citing 0
Cited 4

Structuring Domain-Specific Text Archives by Deriving a Probabilistic XML DTD

PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
Building and Exploiting Ad Hoc Concept Hierarchies for Web Log Analysis

DaWaK 2000 Proceedings of the 4th International Conference on Data Warehousing and Knowledge Discovery
Expanding the taxonomies of bibliographic archives with persistent long-term themes

Proceedings of the 2006 ACM symposium on Applied computing
RELFIN – topic discovery for ontology enhancement and annotation

ESWC'05 Proceedings of the Second European conference on The Semantic Web: research and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Modern organizations are accumulating huge volumesof textual documents. To turn archives into valuable know-ledge sources, textual content must become explicit andqueryable. Semantic tagging with markup languages suchas XML satisfies both requirements. We thus introduce theDIAsDEM* framework for extra ting semantics from structural text units (e.g., sentences), assigning XML tags to them and deriving a flat XML DTD for the archive. DIAsDEM focuses on archives characterized by a peculiar terminologyand by an implicit structure such as court filings and company reports. In the knowledge discovery phase, text units are iteratively clustered by similarity of their content. Eachiteration outputs clusters satisfying a set of quality criteria.Text units contained in these clusters are tagged with semi-automatically determined luster labels and XML tags respectively. Additionally, extracted named entities (e.g.,per-sons) serve as attributes of XML tags. We apply the frame-work in a case study on the German Commercial Register.