Enhancing phylogeography by improving geographical information from GenBank

  • Authors:
  • Matthew Scotch;Indra Neil Sarkar;Changjiang Mei;Robert Leaman;Kei-Hoi Cheung;Pierina Ortiz;Ashutosh Singraur;Graciela Gonzalez

  • Affiliations:
  • Department of Biomedical Informatics, Arizona State University, Tempe, AZ, USA;Center for Clinical and Translational Science, University of Vermont, Burlington, VT, USA and Department of Microbiology & Molecular Genetics, University of Vermont, Burlington, VT, USA and Depart ...;Department of Biomedical Informatics, Arizona State University, Tempe, AZ, USA;Department of Biomedical Informatics, Arizona State University, Tempe, AZ, USA;Yale Center for Medical Informatics, Yale University, New Haven, CT, USA;Department of Biomedical Informatics, Arizona State University, Tempe, AZ, USA;Department of Biomedical Informatics, Arizona State University, Tempe, AZ, USA;Department of Biomedical Informatics, Arizona State University, Tempe, AZ, USA

  • Venue:
  • Journal of Biomedical Informatics
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Phylogeography is a field that focuses on the geographical lineages of species such as vertebrates or viruses. Here, geographical data, such as location of a species or viral host is as important as the sequence information extracted from the species. Together, this information can help illustrate the migration of the species over time within a geographical area, the impact of geography over the evolutionary history, or the expected population of the species within the area. Molecular sequence data from NCBI, specifically GenBank, provide an abundance of available sequence data for phylogeography. However, geographical data is inconsistently represented and sparse across GenBank entries. This can impede analysis and in situations where the geographical information is inferred, and potentially lead to erroneous results. In this paper, we describe the current state of geographical data in GenBank, and illustrate how automated processing techniques such as named entity recognition, can enhance the geographical data available for phylogeographic studies.