Identifying, Indexing, and Ranking Chemical Formulae and Chemical Names in Digital Documents

Authors:
Bingjun Sun;Prasenjit Mitra;C. Lee Giles;Karl T. Mueller
Affiliations:
The Pennsylvania State University;The Pennsylvania State University;The Pennsylvania State University;The Pennsylvania State University
Venue:
ACM Transactions on Information Systems (TOIS)
Year:
2011

Citing 27
Cited 0

Making large-scale support vector machine learning practical

Advances in kernel methods
Algorithmics and applications of tree and graph searching

Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Modern Information Retrieval

Modern Information Retrieval
A Tutorial on Support Vector Machines for Pattern Recognition

Data Mining and Knowledge Discovery
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Maximum Entropy Markov Models for Information Extraction and Segmentation

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
A maximum entropy approach to named entity recognition

A maximum entropy approach to named entity recognition
CloseGraph: mining closed frequent graph patterns

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Boosting support vector machines for text classification through parameter-free threshold relaxation

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Graph indexing: a frequent structure-based approach

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
The complexity of mining maximal frequent itemsets and maximal frequent patterns

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Improving Web search efficiency via a locality based static pruning method

WWW '05 Proceedings of the 14th international conference on World Wide Web
Shallow parsing with conditional random fields

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
A document-centric approach to static index pruning in text retrieval systems

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Fast Kernel Classifiers with Online and Active Learning

The Journal of Machine Learning Research
Extraction and search of chemical formulae in text documents on the web

Proceedings of the 16th international conference on World Wide Web
Topic segmentation with shared topic detection and alignment of multiple documents

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Mining, indexing, and searching for textual chemical molecule information on the web

Proceedings of the 17th international conference on World Wide Web
Detection of IUPAC and IUPAC-like chemical names

Bioinformatics
Introduction to Information Retrieval

Introduction to Information Retrieval
Annotation of chemical named entities

BioNLP '07 Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing
Identifying chemical names in biomedical text: an investigation of the substring co-occurrence based approaches

HLT-SRWS '04 Proceedings of the Student Research Workshop at HLT-NAACL 2004
Semi-supervised sequence modeling with syntactic topic models

AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 2
Independent informative subgraph mining for graph information retrieval

Proceedings of the 18th ACM conference on Information and knowledge management
Learning to rank graphs for online similar graph search

Proceedings of the 18th ACM conference on Information and knowledge management
A dictionary to identify small molecules and drugs in free text

Bioinformatics
High-Throughput identification of chemistry in life science texts

CompLife'06 Proceedings of the Second international conference on Computational Life Sciences

Quantified Score

Hi-index	0.00

Visualization

Abstract

End-users utilize chemical search engines to search for chemical formulae and chemical names. Chemical search engines identify and index chemical formulae and chemical names appearing in text documents to support efficient search and retrieval in the future. Identifying chemical formulae and chemical names in text automatically has been a hard problem that has met with varying degrees of success in the past. We propose algorithms for chemical formula and chemical name tagging using Conditional Random Fields (CRFs) and Support Vector Machines (SVMs) that achieve higher accuracy than existing (published) methods. After chemical entities have been identified in text documents, they must be indexed. In order to support user-provided search queries that require a partial match between the chemical name segment used as a keyword or a partial chemical formula, all possible (or a significant number of) subformulae of formulae that appear in any document and all possible subterms (e.g., “methyl”) of chemical names (e.g., “methylethyl ketone”) must be indexed. Indexing all possible subformulae and subterms results in an exponential increase in the storage and memory requirements as well as the time taken to process the indices. We propose techniques to prune the indices significantly without reducing the quality of the returned results significantly. Finally, we propose multiple query semantics to allow users to pose different types of partial search queries for chemical entities. We demonstrate empirically that our search engines improve the relevance of the returned results for search queries involving chemical entities.