Identification of Chemical Entities in Patent Documents
IWANN '09 Proceedings of the 10th International Work-Conference on Artificial Neural Networks: Part II: Distributed Computing, Artificial Intelligence, Bioinformatics, Soft Computing, and Ambient Assisted Living
Identifying, Indexing, and Ranking Chemical Formulae and Chemical Names in Digital Documents
ACM Transactions on Information Systems (TOIS)
Abstracts versus full texts and patents: a quantitative analysis of biomedical entities
IRFC'10 Proceedings of the First international Information Retrieval Facility conference on Adbances in Multidisciplinary Retrieval
Effective query generation and postprocessing strategies for prior art patent search
Journal of the American Society for Information Science and Technology
Learning to extract chemical names based on random text generation and incomplete dictionary
Proceedings of the 11th International Workshop on Data Mining in Bioinformatics
Chemical Name Extraction Based on Automatic Training Data Generation and Rich Feature Set
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
A two-phase hybrid of semi-supervised and active learning approach for sequence labeling
Intelligent Data Analysis
Hi-index | 3.84 |
Motivation: Chemical compounds like small signal molecules or other biological active chemical substances are an important entity class in life science publications and patents. Several representations and nomenclatures for chemicals like SMILES, InChI, IUPAC or trivial names exist. Only SMILES and InChI names allow a direct structure search, but in biomedical texts trivial names and Iupac like names are used more frequent. While trivial names can be found with a dictionary-based approach and in such a way mapped to their corresponding structures, it is not possible to enumerate all IUPAC names. In this work, we present a new machine learning approach based on conditional random fields (CRF) to find mentions of IUPAC and IUPAC-like names in scientific text as well as its evaluation and the conversion rate with available name-to-structure tools. Results: We present an IUPAC name recognizer with an F1 measure of 85.6% on a MEDLINE corpus. The evaluation of different CRF orders and offset conjunction orders demonstrates the importance of these parameters. An evaluation of hand-selected patent sections containing large enumerations and terms with mixed nomenclature shows a good performance on these cases (F1 measure 81.5%). Remaining recognition problems are to detect correct borders of the typically long terms, especially when occurring in parentheses or enumerations. We demonstrate the scalability of our implementation by providing results from a full MEDLINE run. Availability: We plan to publish the corpora, annotation guideline as well as the conditional random field model as a UIMA component. Contact: roman.klinger@scai.fraunhofer.de