Recent trends in hierarchic document clustering: a critical review
Information Processing and Management: an International Journal
A language modeling approach to information retrieval
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions
Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology
Contrast and variability in gene names
BioMed '02 Proceedings of the ACL-02 workshop on Natural language processing in the biomedical domain - Volume 3
EURASIP Journal on Bioinformatics and Systems Biology
A priority model for named entities
BioNLP '06 Proceedings of the Workshop on Linking Natural Language Processing and Biology: Towards Deeper Biological Literature Analysis
Exploring two biomedical text genres for disease recognition
BioNLP '09 Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing
Hi-index | 0.00 |
Structured information revealed by manual annotation of disease descriptions with UMLS meta-thesaurus concepts, can provide high-quality reliable data sources for the research community. While progress in both extent and annotation has been made, only a limited scope of diseases has been annotated, largely because of the required human resources. Since annotating text is time consuming and the variation of disease descriptions makes the annotation task difficult, it is useful to develop systems for automatic mapping of biomedical sentences into an ontology. Our goal is to automatically map biomedical sentences into UMLS disease concepts. Previous methods including statistical methods, are still weaker than dictionary-based simple matching methods. To consider an alternative to both, we demonstrate how the mapping problem can be viewed as a document retrieval problem: under this perspective, the mapping integrates information based on a language model, document frequency, and distance measures. Our improvements are based on a three-step method using information retrieval and clustering. In the first step, we retrieve the top-10 ranked relevant UMLS concept entries using an integrated information retrieval model. In the second step, we cluster the retrieved concept entries according to shared words. In the final step, we select one answer for each cluster using a threshold. Our experiments are promising, and on typical data show a precision of 73.28%, recall of 77.51%, and F-measure of 75.34% significantly outperforming previous methods based on statistics, dictionaries, and the MetaMap by 6.95 to 9.95 percent.