High-performance information extraction with AliBaba

Authors:
Peter Palaga;Long Nguyen;Ulf Leser;Jörg Hakenberg
Affiliations:
Humboldt-Universität zu Berlin, Germany;Humboldt-Universität zu Berlin, Germany;Humboldt-Universität zu Berlin, Germany;Arizona State University, Tempe, Arizona
Venue:
Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Year:
2009

Citing 9
Cited 5

Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
How to build a WebFountain: An architecture for very large-scale text analytics

IBM Systems Journal
Discovering patterns to extract protein–protein interactions from the literature: Part II

Bioinformatics
Managing information extraction: state of the art and research directions

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Extraction of regulatory gene/protein networks from Medline

Bioinformatics
AliBaba: PubMed as a graph

Bioinformatics
EntityRank: searching entities directly and holistically

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Efficient Information Extraction over Evolving Text Data

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
A framework for schema-driven relationship discovery from unstructured text

ISWC'06 Proceedings of the 5th international conference on The Semantic Web

From information to knowledge: harvesting entities and relationships from web sources

Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Extracting Protein Interactions from Text with the Unified AkaneRE Event Extraction System

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Efficient Extraction of Protein-Protein Interactions from Full-Text Articles

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Enabling information extraction by inference of regular expressions from sample entities

Proceedings of the 20th ACM international conference on Information and knowledge management
Regular path queries on large graphs

SSDBM'12 Proceedings of the 24th international conference on Scientific and Statistical Database Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

A wealth of information is available only in web pages, patents, publications etc. Extracting information from such sources is challenging, both due to the typically complex language processing steps required and to the potentially large number of texts that need to be analyzed. Furthermore, integrating extracted data with other sources of knowledge often is mandatory for subsequent analysis. In this demo, we present the AliBaba system for scalable information extraction from biomedical documents. Unlike many other systems, AliBaba performs both entity extraction and relationship extraction and graphically visualizes the resulting network of inter-connected objects. It leverages the PubMed search engine for selection of relevant documents. The technical novelty of AliBaba is twofold: (a) its ability to automatically learn language patterns for relationship extraction without an annotated corpus, and (b) its high performance pattern matching algorithm. We show that a simple yet effective pattern filtering technique improves the runtime of the system drastically without harming its extraction effectiveness. Although AliBaba has been implemented for biomedical texts, its underlying principles should also be applicable in any other domain.