Exploiting semantic annotations in math information retrieval

Authors:
Petr Sojka
Affiliations:
Masaryk University, Brno, Czech Rep
Venue:
Proceedings of the fifth workshop on Exploiting semantic annotations in information retrieval
Year:
2012

Citing 7
Cited 2

INFTY: an integrated OCR system for mathematical documents

Proceedings of the 2003 ACM symposium on Document engineering
Large linguistically-processed web corpora for multiple languages

EACL '06 Proceedings of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics: Posters & Demonstrations
Cross lingual text classification by mining multilingual topics from wikipedia

Proceedings of the fourth ACM international conference on Web search and data mining
Indexing and searching mathematics in digital libraries: architecture, design and scalability issues

MKM'11 Proceedings of the 18th Calculemus and 10th international conference on Intelligent computer mathematics
Project EuDML: a first year demonstration

MKM'11 Proceedings of the 18th Calculemus and 10th international conference on Intelligent computer mathematics
The art of mathematics retrieval

Proceedings of the 11th ACM symposium on Document engineering
MaxTract: converting PDF to LATEX, MathML and text

CICM'12 Proceedings of the 11th international conference on Intelligent Computer Mathematics

Fifth workshop on exploiting semantic annotations in information retrieval: ESAIR''12)

Proceedings of the 21st ACM international conference on Information and knowledge management
Report on the fifth workshop on exploiting semantic annotations in information retrieval (ESAIR'12)

ACM SIGIR Forum

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes exploitation of semantic annotations in the design and architecture of MIaS (Math Indexer and Searcher) system for mathematics retrieval. Basing on the claim that navigational and research search are `killer' applications for digital library such as the European Digital Mathematics Library, EuDML, we argue for an approach based on Natural Language Processing techniques as used in corpus management systems such as the Sketch Engine, that will reach web scalability and avoid inference problems. The main ideas are 1) to augment surface texts (including math formulae) with additional linked representations bearing semantic information (expanded formulae as text, canonicalized text and subformulae) for indexing, including support for indexing structural information (expressed as Content MathML or other tree structures) and 2) use semantic user preferences to order found documents. The semantic enhancements of the MIaS system are being implemented as a math-aware search engine based on the state-of-the-art system Apache Lucene, with support for [MathML] tree indexing. Scalability issues have been checked against more than 400,000 arXiv documents.