Database support for species extraction from the biosystematics literature: a feasibility demonstration

Authors:
Ralf Duckstein;Klemens Böhm
Affiliations:
Otto-von-Guericke-University Magdeburg, Magdeburg, Germany;Otto-von-Guericke-University Magdeburg, Magdeburg, Germany
Venue:
Proceedings of the thirteenth ACM international conference on Information and knowledge management
Year:
2004

Citing 9
Cited 0

Combining fuzzy information from multiple systems (extended abstract)

PODS '96 Proceedings of the fifteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Improving automatic query expansion

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Storing semistructured data with STORED

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
A vector space model for automatic indexing

Communications of the ACM
XIRQL: a query language for information retrieval in XML documents

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Information Retrieval: Algorithms and Heuristics

Information Retrieval: Algorithms and Heuristics
Accelerating XPath location steps

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Processing Complex Similarity Queries with Distance-Based Access Methods

EDBT '98 Proceedings of the 6th International Conference on Extending Database Technology: Advances in Database Technology
Quilt: An XML Query Language for Heterogeneous Data Sources

Selected papers from the Third International Workshop WebDB 2000 on The World Wide Web and Databases

Quantified Score

Hi-index	0.01

Visualization

Abstract

A part of the biosystematics literature is currently being digitized and manually marked up with XML. Fast search on such documents shall be feasible. But marking up such documents incurs high costs, and biologists would like to know the value of such an activity in advance. Deploying standard XML database technology in a straightforward way is not feasible, because of two characteristics of biosystematics documents. The first one is that descriptions of taxa are related, i.e., a more specific taxon should inherit from a more general one. The combination of inheritance with information-retrieval mechanisms gives rise to difficulties addressed in this article. The second issue is the frequent occurrence of very specific technical terms in such documents, i.e., geographical information or biological terms. To investigate the characteristics of the search in the presence of such difficulties, we have designed and implemented a respective system, based on relational database technology. We use a collection of XML documents that mimics the characteristics of biosystematics documents, as we will explain. We propose two query-evaluation alternatives and compare them by means of performance experiments. It turns out that our techniques can administer the envisioned corpus of documents efficiently and cope with those problems at the same time.