An information-theoretic based model for large-scale contextual text processing

Authors:
Patrick Perrin;Fred Petry
Affiliations:
-;-
Venue:
Information Sciences: an International Journal
Year:
1999

Citing 16
Cited 0

Attention, intentions, and the structure of discourse

Computational Linguistics
Automatic text processing

Automatic text processing
A statistical approach to machine translation

Computational Linguistics
An approach to the automatic construction of global thesauri

Information Processing and Management: an International Journal
Corpus-based thematic analysis

Text-based intelligent systems
Context and structure in automated full-text information access

Context and structure in automated full-text information access
Inductive logic programming and knowledge discovery in databases

Advances in knowledge discovery and data mining
Contextual representation and learning for unsupervised knowledge discovery in texts

Contextual representation and learning for unsupervised knowledge discovery in texts
Contextual Text Representation for Unsupervised Discovery in Texts

PAKDD '98 Proceedings of the Second Pacific-Asia Conference on Research and Development in Knowledge Discovery and Data Mining
Automatic acquisition of subcategorization frames from untagged text

ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics
Automatic acquisition of a large subcategorization dictionary from corpora

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
One sense per discourse

HLT '91 Proceedings of the workshop on Speech and Natural Language
A Mathematical Theory of Communication

A Mathematical Theory of Communication
Probability and Information: An Integrated Approach

Probability and Information: An Integrated Approach
Using information content to evaluate semantic similarity in a taxonomy

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 1
Hierarchical Bayesian clustering for automatic text classification

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2

Quantified Score

Hi-index	0.10

Visualization

Abstract

We propose a supervised systematic approach for knowledge extraction from verylarge sets of full texts, such as those retrieved by search engines over the world wide web. The proposed model is entirely data-driven and relies only on the actual content of the texts in the selected corpus. We suggest that the relative distance of terms within a text gives sufficient information about its structure and its relevant content, therefore serving as a good candidate to represent effectively the text content for knowledge elicitation tasks. We qualitatively demonstrate that a useful text structure and content can be systematically extracted by collocational lexical analysis without the need to encode any supplemental sources of knowledge. We present an algorithm that systematically recognizes the most relevant facts in the texts and labels them by their overall theme, dictated by local contextual information. It exploits domain independent lexical frequencies and mutual information measures to find the relevant contextual units in the texts. We report results from experiments in a real-world textual database of psychiatric evaluation reports.