Indexing and searching mathematics in digital libraries: architecture, design and scalability issues

  • Authors:
  • Petr Sojka;Martin Líška

  • Affiliations:
  • Masaryk University, Faculty of Informatics, Brno, Czech Republic;Masaryk University, Faculty of Informatics, Brno, Czech Republic

  • Venue:
  • MKM'11 Proceedings of the 18th Calculemus and 10th international conference on Intelligent computer mathematics
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper surveys approaches and systems for searching mathematical formulae in mathematical corpora and on the web. The design and architecture of our MIaS (Math Indexer and Searcher) system is presented, and our design decisions are discussed in detail. An approach based on PresentationMathML using a similarity of math subformulae is suggested and verified by implementing it as a math-aware search engine based on the state-of-the-art system, Apache Lucene. Scalability issues were checked based on 324,000 real scientific documents from arXiv archive with 112 million mathematical formulae. More than two billions MathML subformulae were indexed using our Solr-compatible Lucene extension.