Mining of Protein Subcellular Localizations based on a Syntactic Dependency Tree and WordNet

  • Authors:
  • Mi-Young Kim

  • Affiliations:
  • School of Computer Science and Engineering, Sungshin Women's University, Dongseon-dong, Seongbuk-gu Seoul, 136-742, South Korea, miykim@sungshin.ac.kr

  • Venue:
  • Proceedings of the 2008 conference on Knowledge-Based Software Engineering: Proceedings of the Eighth Joint Conference on Knowledge-Based Software Engineering
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Detection of protein subcellular localization is essential in information extraction from biomolecular texts. There has been a great deal of research on text mining to detect protein subcellular localization information in documents. Previous researches have insisted that linguistic information is useful for identifying the subcellular localizations of proteins. However, previous systems for detecting protein subcellular localizations have used only shallow syntactic parsers, and showed poor recall. Thus, there remains a need to use a deep level of linguistic knowledge to the analysis of text. To improve performance in detecting protein subcellular localization information, this paper proposes a method based on a syntactic dependency tree and WordNet. From the syntactic dependency tree, we construct syntactic paths from a protein to its location candidate. Then, we retrieve syntactic and semantic information from the root, protein subtree and location subtree of each syntactic path. We extract syntactic category and syntactic direction as syntactic information, and synset offset of the WordNet thesaurus as semantic information. According to the information, we extract (protein, localization) pairs. Even with no biomolecular knowledge, our method shows reasonable performance in experiments using Medline abstract data. The experimental results show that our method outperforms previous methods, and the obtained syntactic and semantic information contributes to the improvement of the performance.