Structuring Domain-Specific Text Archives by Deriving a Probabilistic XML DTD

Authors:
Karsten Winkler;Myra Spiliopoulou
Affiliations:
-;-
Venue:
PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
Year:
2002

Citing 13
Cited 2

Data on the Web: from relations to semistructured data and XML

Data on the Web: from relations to semistructured data and XML
Stochastic Grammatical Inference of Text Database Structure

Machine Learning
Concept-based knowledge discovery in texts extracted from the Web

ACM SIGKDD Explorations Newsletter
Document Warehousing and Text Mining: Techniques for Improving Business Operations, Marketing, and Sales

Document Warehousing and Text Mining: Techniques for Improving Business Operations, Marketing, and Sales
Medical Data Mining and Knowledge Discovery

Medical Data Mining and Knowledge Discovery
Discovering Structural Association of Semistructured Data

IEEE Transactions on Knowledge and Data Engineering
Mining Sequential Patterns

ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
The DIAsDEM Framework for Converting Domain-Specific Texts into XML Documents with Data Mining Techniques

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
GETESS: Constructing a Linguistic Search Index for an Internet Search Engine

NLDB '00 Proceedings of the 5th International Conference on Applications of Natural Language to Information Systems-Revised Papers
Text Mining at the Term Level

PKDD '98 Proceedings of the Second European Symposium on Principles of Data Mining and Knowledge Discovery
Schema Mining: Finding Structural Regularity among Semistructured Data

PKDD '00 Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery
DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
From manual to semi-automatic semantic annotation: about ontology-based text annotation tools

Proceedings of the COLING-2000 Workshop on Semantic Annotation and Intelligent Content

Expanding the taxonomies of bibliographic archives with persistent long-term themes

Proceedings of the 2006 ACM symposium on Applied computing
RELFIN – topic discovery for ontology enhancement and annotation

ESWC'05 Proceedings of the Second European conference on The Semantic Web: research and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Domain-specific documents often share an inherent, though undocumented structure. This structure should be made explicit to facilitate efficient, structure-based search in archives as well as information integration. Inferring a semantically structured XML DTD for an archive and subsequently transforming its texts into XML documents is a promising method to reach these objectives. Based on the KDD-driven DIAs-DEM framework, we propose a new method to derive an archive-specific structured XML document type definition (DTD). Our approach utilizes association rule discovery and sequence mining techniques to structure a previously derived flat, i.e. unstructured DTD. We introduce the notion of a probabilistic DTD that is derived by discovering associations among and frequent sequences of XML tags, respectively.