Exploring contextual models in chemical patent search

Authors:
Jay Urbain;Ophir Frieder
Affiliations:
Electrical Engineering & Computer Science Department, Milwaukee School of Engineering, Milwaukee, WI;Department of Computer Science, Georgetown University, Washington, DC
Venue:
IRFC'10 Proceedings of the First international Information Retrieval Facility conference on Adbances in Multidisciplinary Retrieval
Year:
2010

Citing 5
Cited 1

The data warehouse toolkit: practical techniques for building dimensional data warehouses

The data warehouse toolkit: practical techniques for building dimensional data warehouses
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals

Data Mining and Knowledge Discovery
Introduction to the special issue on patent processing

Information Processing and Management: an International Journal
Probabilistic passage models for semantic search of genomics literature

Journal of the American Society for Information Science and Technology
A dimensional retrieval model for integrating semantics and statistical evidence in context for genomics literature search

Computers in Biology and Medicine

Scaling up high-value retrieval to medium-volume data

IRFC'10 Proceedings of the First international Information Retrieval Facility conference on Adbances in Multidisciplinary Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

We explore the development of probabilistic retrieval models for integrating term statistics with entity search using multiple levels of document context to improve the performance of chemical patent search. A distributed indexing model was developed to enable efficient named entity search and aggregation of term statistics at multiple levels of patent structure including individual words, sentences, claims, descriptions, abstracts, and titles. The system can be scaled to an arbitrary number of compute instances in a cloud computing environment to support concurrent indexing and query processing operations on large patent collections. The query processing algorithm for patent prior art search uses information extraction techniques to identify candidate entities and distinctive terms from the query patent’s title, abstract, description, and claim sections. Structured queries integrating terms and entities in context are automatically generated to test the validity of each section of potentially relevant patents. The system was deployed across 15 Amazon Web Services (AWS) Elastic Cloud Compute (EC2) instances to support efficient indexing and query processing of the relatively large 100G+ collection of chemical patent documents. We evaluated several retrieval models for integrating statistics of candidate entities with term statistics at multiple levels of patent structure to identify relevant patents for prior art search. Our top performing retrieval model integrating contextual evidence from multiple levels of patent structure resulted in bpref measurements of 0.8929 for the prior art search task, exceeding the top results reported from the 2009 TREC Chemistry track.