Extraction and search of chemical formulae in text documents on the web

Authors:
Bingjun Sun;Qingzhao Tan;Prasenjit Mitra;C. Lee Giles
Affiliations:
The Pennsylvania State University, University Park, PA;The Pennsylvania State University, University Park, PA;The Pennsylvania State University, University Park, PA;The Pennsylvania State University, University Park, PAQingzhao Tan
Venue:
Proceedings of the 16th international conference on World Wide Web
Year:
2007

Citing 18
Cited 10

A maximum entropy approach to natural language processing

Computational Linguistics
Inducing Features of Random Fields

IEEE Transactions on Pattern Analysis and Machine Intelligence
Algorithmics and applications of tree and graph searching

Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Frequent Subgraph Discovery

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Maximum Entropy Markov Models for Information Extraction and Segmentation

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
A maximum entropy approach to named entity recognition

A maximum entropy approach to named entity recognition
Boosting support vector machines for text classification through parameter-free threshold relaxation

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Graph indexing: a frequent structure-based approach

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Semantic web applications to e-science in silico experiments

Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
Shallow parsing with conditional random fields

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text

Bioinformatics
Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Finding advertising keywords on web pages

Proceedings of the 15th international conference on World Wide Web
Knowledge modeling and its application in life sciences: a tale of two ontologies

Proceedings of the 15th international conference on World Wide Web
Feature-based similarity search in graph structures

ACM Transactions on Database Systems (TODS)
Fast Kernel Classifiers with Online and Active Learning

The Journal of Machine Learning Research
Efficiently inducing features of conditional random fields

UAI'03 Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence

ChemXSeer: a digital library and data repository for chemical kinetics

Proceedings of the ACM first workshop on CyberInfrastructure: information management in eScience
Mining, indexing, and searching for textual chemical molecule information on the web

Proceedings of the 17th international conference on World Wide Web
A proposal for chemical information retrieval evaluation

Proceedings of the 1st ACM workshop on Patent information retrieval
Annotation of chemical named entities

BioNLP '07 Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing
Independent informative subgraph mining for graph information retrieval

Proceedings of the 18th ACM conference on Information and knowledge management
Learning to rank graphs for online similar graph search

Proceedings of the 18th ACM conference on Information and knowledge management
Mixing statistical and symbolic approaches for chemical names recognition

CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing
Exposing the hidden web for chemical digital libraries

Proceedings of the 10th annual joint conference on Digital libraries
Identifying, Indexing, and Ranking Chemical Formulae and Chemical Names in Digital Documents

ACM Transactions on Information Systems (TOIS)
Taking chemistry to the task: personalized queries for chemical digital libraries

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries

Quantified Score

Hi-index	0.00

Visualization

Abstract

Often scientists seek to search for articles on the Web related to a particular chemical. When a scientist searches for a chemical formula using a search engine today, she gets articles where the exact keyword string expressing the chemical formula is found. Searching for the exact occurrence of keywords during searching results in two problems for this domain: a) if the author searches for CH4 and the article has H4C, the article is not returned, and b) ambiguous searches like "He" return all documents where Helium is mentioned as well as documents where the pronoun "he" occurs. To remedy these deficiencies, we propose a chemical formula search engine. To build a chemical formula search engine, we must solve the following problems: 1) extract chemical formulae from text documents, 2) index chemical formulae, and 3) designranking functions for the chemical formulae. Furthermore, query models are introduced for formula search, and for each a scoring scheme based on features of partial formulae is proposed tomeasure the relevance of chemical formulae and queries. We evaluate algorithms for identifying chemical formulae in documents using classification methods based on Support Vector Machines(SVM), and a probabilistic model based on conditional random fields (CRF). Different methods for SVM and CRF to tune the trade-off between recall and precision forim balanced data are proposed to improve the overall performance. A feature selection method based on frequency and discrimination isused to remove uninformative and redundant features. Experiments show that our approaches to chemical formula extraction work well, especially after trade-off tuning. The results also demonstrate that feature selection can reduce the index size without changing ranked query results much.