An Approach to Mathematical Search Through Query Formulation and Data Normalization

Authors:
Robert Miner;Rajesh Munavalli
Affiliations:
Design Science, Inc., St. Paul, MN 55101, USA;Design Science, Inc., St. Paul, MN 55101, USA
Venue:
Calculemus '07 / MKM '07 Proceedings of the 14th symposium on Towards Mechanized Mathematical Assistants: 6th International Conference
Year:
2007

Citing 6
Cited 5

Extended Boolean information retrieval

Communications of the ACM
Technical Aspects of the Digital Library of Mathematical Functions

Annals of Mathematics and Artificial Intelligence
Information Retrieval in MML

MKM '03 Proceedings of the Second International Conference on Mathematical Knowledge Management
Average gain ratio: a simple retrieval performance measure for evaluation with multiple relevance levels

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Information retrieval and rendering with MML query

MKM'06 Proceedings of the 5th international conference on Mathematical Knowledge Management
A content based mathematical search engine: whelp

TYPES'04 Proceedings of the 2004 international conference on Types for Proofs and Programs

Math information retrieval: user requirements and prototype implementation

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
A lattice-based approach for mathematical search using Formal Concept Analysis

Expert Systems with Applications: An International Journal
A math-aware search engine for math question answering system

Proceedings of the 21st ACM international conference on Information and knowledge management
Mathematical equation retrieval using plain words as a query

Proceedings of the 21st ACM international conference on Information and knowledge management
WikiMirs: a mathematical information retrieval system for wikipedia

Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries

Quantified Score

Hi-index	0.00

Visualization

Abstract

This article describes an approach to searching for mathematical notation. The approach aims at a search system that can be effectively and economically deployed, and that produces good results with a large portion of the mathematical content freely available on the World Wide Web today. The basic concept is to linearize mathematical notation as a sequence of text tokens, which are then indexed by a traditional text search engine. However, naive generalization of the "phrase query" of text search to mathematical expressions performs poorly. For adequate precision and recall in the mathematical context, more complex combinations of atomic queries are required. Our approach is to query for a weighted collection of significant subexpressions, where weights depend on expression complexity, nesting depth, expression length, and special boosting of well-known expressions.To make this approach perform well with the technical content that is readily obtainable on the World Wide Web, either directly or through conversion, it is necessary to extensively normalize mathematical expression data to eliminate accidently or irrelevant encoding differences. To do this, a multi-pass normalization process is applied. In successive stages, MathML and XML errors are corrected, character data is canonicalized, white space and other insignificant data is removed, and heuristics are applied to disambiguated expressions. Following these preliminary stages, the MathML tree structure is canonicalized via an augmented precedence parsing step. Finally, mathematical synonyms and some variable names are canonicalized.