Structuring Domain-Specific Text Archives by Deriving a Probabilistic XML DTD
PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
Building and Exploiting Ad Hoc Concept Hierarchies for Web Log Analysis
DaWaK 2000 Proceedings of the 4th International Conference on Data Warehousing and Knowledge Discovery
Expanding the taxonomies of bibliographic archives with persistent long-term themes
Proceedings of the 2006 ACM symposium on Applied computing
RELFIN – topic discovery for ontology enhancement and annotation
ESWC'05 Proceedings of the Second European conference on The Semantic Web: research and Applications
Hi-index | 0.00 |
Modern organizations are accumulating huge volumesof textual documents. To turn archives into valuable know-ledge sources, textual content must become explicit andqueryable. Semantic tagging with markup languages suchas XML satisfies both requirements. We thus introduce theDIAsDEM* framework for extra ting semantics from structural text units (e.g., sentences), assigning XML tags to them and deriving a flat XML DTD for the archive. DIAsDEM focuses on archives characterized by a peculiar terminologyand by an implicit structure such as court filings and company reports. In the knowledge discovery phase, text units are iteratively clustered by similarity of their content. Eachiteration outputs clusters satisfying a set of quality criteria.Text units contained in these clusters are tagged with semi-automatically determined luster labels and XML tags respectively. Additionally, extracted named entities (e.g.,per-sons) serve as attributes of XML tags. We apply the frame-work in a case study on the German Commercial Register.