Mining, indexing, and searching for textual chemical molecule information on the web

Authors:
Bingjun Sun;Prasenjit Mitra;C. Lee Giles
Affiliations:
The Pennsylvania State University, University Park, PA, USA;The Pennsylvania State University, University Park, PA, USA;The Pennsylvania State University, University Park, PA, USA
Venue:
Proceedings of the 17th international conference on World Wide Web
Year:
2008

Citing 16
Cited 10

Algorithmics and applications of tree and graph searching

Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Frequent Subgraph Discovery

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Maximum Entropy Markov Models for Information Extraction and Segmentation

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
CloseGraph: mining closed frequent graph patterns

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
The complexity of mining maximal frequent itemsets and maximal frequent patterns

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Improving Web search efficiency via a locality based static pruning method

WWW '05 Proceedings of the 14th international conference on World Wide Web
Hierarchical Clustering Algorithms for Document Datasets

Data Mining and Knowledge Discovery
ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text

Bioinformatics
Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Feature-based similarity search in graph structures

ACM Transactions on Database Systems (TODS)
Extraction and search of chemical formulae in text documents on the web

Proceedings of the 16th international conference on World Wide Web
Answering relationship queries on the web

Proceedings of the 16th international conference on World Wide Web
Dynamic personalized pagerank in entity-relation graphs

Proceedings of the 16th international conference on World Wide Web
Topic segmentation with shared topic detection and alignment of multiple documents

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Semi-supervised sequence modeling with syntactic topic models

AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 2

Independent informative subgraph mining for graph information retrieval

Proceedings of the 18th ACM conference on Information and knowledge management
Learning to rank graphs for online similar graph search

Proceedings of the 18th ACM conference on Information and knowledge management
Exposing the hidden web for chemical digital libraries

Proceedings of the 10th annual joint conference on Digital libraries
oreChem ChemXSeer: a semantic digital library for chemistry

Proceedings of the 10th annual joint conference on Digital libraries
Identifying, Indexing, and Ranking Chemical Formulae and Chemical Names in Digital Documents

ACM Transactions on Information Systems (TOIS)
Taking chemistry to the task: personalized queries for chemical digital libraries

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Quantifying the impact of concept recognition on biomedical information retrieval

Information Processing and Management: an International Journal
Effective query generation and postprocessing strategies for prior art patent search

Journal of the American Society for Information Science and Technology
Learning to extract chemical names based on random text generation and incomplete dictionary

Proceedings of the 11th International Workshop on Data Mining in Bioinformatics
Chemical Name Extraction Based on Automatic Training Data Generation and Rich Feature Set

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Current search engines do not support user searches for chemical entities (chemical names and formulae) beyond simple keyword searches. Usually a chemical molecule can be represented in multiple textual ways. A simple keyword search would retrieve only the exact match and not the others. We show how to build a search engine that enables searches for chemical entities and demonstrate empirically that it improves the relevance of returned documents. Our search engine first extracts chemical entities from text, performs novel indexing suitable for chemical names and formulae, and supports different query models that a scientist may require. We propose a model of hierarchical conditional random fields for chemical formula tagging that considers long-term dependencies at the sentence level. To substring searches of chemical names, a search engine must index substrings of chemical names. Indexing all possible sub-sequences is not feasible in practice. We propose an algorithm for independent frequent subsequence mining to discover sub-terms of chemical names with their probabilities. We then propose an unsupervised hierarchical text segmentation (HTS) method to represent a sequence with a tree structure based on discovered independent frequent subsequences, so that sub-terms on the HTS tree should be indexed. Query models with corresponding ranking functions are introduced for chemical name searches. Experiments show that our approaches to chemical entity tagging perform well. Furthermore, we show that index pruning can reduce the index size and query time without changing the returned ranked results significantly. Finally, experiments show that our approaches out-perform traditional methods for document search with ambiguous chemical terms.