Attention, intentions, and the structure of discourse
Computational Linguistics
Automatic text processing
A statistical approach to machine translation
Computational Linguistics
An approach to the automatic construction of global thesauri
Information Processing and Management: an International Journal
Corpus-based thematic analysis
Text-based intelligent systems
Context and structure in automated full-text information access
Context and structure in automated full-text information access
Inductive logic programming and knowledge discovery in databases
Advances in knowledge discovery and data mining
Contextual representation and learning for unsupervised knowledge discovery in texts
Contextual representation and learning for unsupervised knowledge discovery in texts
Contextual Text Representation for Unsupervised Discovery in Texts
PAKDD '98 Proceedings of the Second Pacific-Asia Conference on Research and Development in Knowledge Discovery and Data Mining
Automatic acquisition of subcategorization frames from untagged text
ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics
Automatic acquisition of a large subcategorization dictionary from corpora
ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
HLT '91 Proceedings of the workshop on Speech and Natural Language
A Mathematical Theory of Communication
A Mathematical Theory of Communication
Probability and Information: An Integrated Approach
Probability and Information: An Integrated Approach
Using information content to evaluate semantic similarity in a taxonomy
IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 1
Hierarchical Bayesian clustering for automatic text classification
IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2
Hi-index | 0.10 |
We propose a supervised systematic approach for knowledge extraction from verylarge sets of full texts, such as those retrieved by search engines over the world wide web. The proposed model is entirely data-driven and relies only on the actual content of the texts in the selected corpus. We suggest that the relative distance of terms within a text gives sufficient information about its structure and its relevant content, therefore serving as a good candidate to represent effectively the text content for knowledge elicitation tasks. We qualitatively demonstrate that a useful text structure and content can be systematically extracted by collocational lexical analysis without the need to encode any supplemental sources of knowledge. We present an algorithm that systematically recognizes the most relevant facts in the texts and labels them by their overall theme, dictated by local contextual information. It exploits domain independent lexical frequencies and mutual information measures to find the relevant contextual units in the texts. We report results from experiments in a real-world textual database of psychiatric evaluation reports.