The DIAsDEM Framework for Converting Domain-Specific Texts into XML Documents with Data Mining Techniques

  • Authors:
  • Henner Graubitz;Myra Spiliopoulou;Karsten Winkler

  • Affiliations:
  • -;-;-

  • Venue:
  • ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
  • Year:
  • 2001

Quantified Score

Hi-index 0.00

Visualization

Abstract

Modern organizations are accumulating huge volumesof textual documents. To turn archives into valuable know-ledge sources, textual content must become explicit andqueryable. Semantic tagging with markup languages suchas XML satisfies both requirements. We thus introduce theDIAsDEM* framework for extra ting semantics from structural text units (e.g., sentences), assigning XML tags to them and deriving a flat XML DTD for the archive. DIAsDEM focuses on archives characterized by a peculiar terminologyand by an implicit structure such as court filings and company reports. In the knowledge discovery phase, text units are iteratively clustered by similarity of their content. Eachiteration outputs clusters satisfying a set of quality criteria.Text units contained in these clusters are tagged with semi-automatically determined luster labels and XML tags respectively. Additionally, extracted named entities (e.g.,per-sons) serve as attributes of XML tags. We apply the frame-work in a case study on the German Commercial Register.