A knowledge-driven approach to biomedical document conceptualization

Authors:
Hai-Tao Zheng;Charles Borchert;Yong Jiang
Affiliations:
Tsinghua-Southampton Web Science Laboratory at Shenzhen, Graduate School at Shenzhen, Tsinghua University, Shenzhen, China;Biomedical Knowledge Engineering Laboratory, College of Dentistry, Seoul National University, Seoul, Republic of Korea;Tsinghua-Southampton Web Science Laboratory at Shenzhen, Graduate School at Shenzhen, Tsinghua University, Shenzhen, China
Venue:
Artificial Intelligence in Medicine
Year:
2010

Citing 13
Cited 0

Machine learning: applications in expert systems and information retrieval

Machine learning: applications in expert systems and information retrieval
Visualization of a document collection: the vibe system

Information Processing and Management: an International Journal
WordNet: a lexical database for English

Communications of the ACM
Web document clustering: a feasibility demonstration

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Deriving concept hierarchies from text

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Modern Information Retrieval

Modern Information Retrieval
Document clustering with committees

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Assigning Gene Ontology Categories (GO) to Yeast Genes Using Text-Based Supervised Learning Methods

CSB '04 Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference
A text-mining system for knowledge discovery from biomedical documents

IBM Systems Journal
A Concept-Driven Algorithm for Clustering Search Results

IEEE Intelligent Systems
MeSHer: identifying biological concepts in microarray assays based on PubMed references and MeSH terms

Bioinformatics
Visual analytics: Storylines: Visual exploration and analysis in latent semantic spaces

Computers and Graphics
Learning concept hierarchies from text corpora using formal concept analysis

Journal of Artificial Intelligence Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

Objective: Biomedical document conceptualization is the process of clustering biomedical documents based on ontology-represented domain knowledge. The result of this process is the representation of the biomedical documents by a set of key concepts and their relationships. Most of clustering methods cluster documents based on invariant domain knowledge. The objective of this work is to develop an effective method to cluster biomedical documents based on various user-specified ontologies, so that users can exploit the concept structures of documents more effectively. Methods: We develop a flexible framework to allow users to specify the knowledge bases, in the form of ontologies. Based on the user-specified ontologies, we develop a key concept induction algorithm, which uses latent semantic analysis to identify key concepts and cluster documents. A corpus-related ontology generation algorithm is developed to generate the concept structures of documents. Results: Based on two biomedical datasets, we evaluate the proposed method and five other clustering algorithms. The clustering results of the proposed method outperform the five other algorithms, in terms of key concept identification. With respect to the first biomedical dataset, our method has the F-measure values 0.7294 and 0.5294 based on the MeSH ontology and gene ontology (GO), respectively. With respect to the second biomedical dataset, our method has the F-measure values 0.6751 and 0.6746 based on the MeSH ontology and GO, respectively. Both results outperforms the five other algorithms in terms of F-measure. Based on the MeSH ontology and GO, the generated corpus-related ontologies show informative conceptual structures. Conclusions: The proposed method enables users to specify the domain knowledge to exploit the conceptual structures of biomedical document collections. In addition, the proposed method is able to extract the key concepts and cluster the documents with a relatively high precision.