Transforming the ar Χ iv to XML

Authors:
Heinrich Stamerjohanns;Michael Kohlhase
Affiliations:
Computer Science, Jacobs University Bremen,;Computer Science, Jacobs University Bremen,
Venue:
Proceedings of the 9th AISC international conference, the 15th Calculemas symposium, and the 7th international MKM conference on Intelligent Computer Mathematics
Year:
2008

Citing 0
Cited 2

Abramowitz and stegun: a resource for mathematical document analysis

CICM'12 Proceedings of the 11th international conference on Intelligent Computer Mathematics
MathWebSearch 0.5: scaling an open formula search engine

CICM'12 Proceedings of the 11th international conference on Intelligent Computer Mathematics

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe an experiment of transforming large collections of documents to more machine-understandable representations. Concretely, we are translating the collection of scientific publications of the Cornell e-Print Archive (arXiv) using the to XML converter which is currently under development.The main technical task of our arXMLivproject is to supply LaTeXMLbindings for the (thousands of) classes and packages used in the arXivcollection. For this we have developed a distributed build system that reiteratively runs LaTeXMLover the arXivcollection and collects statistics about e.g. the most sorely missing LaTeXMLbindings and clusters common error events. This creates valuable feedback to both the developers of the LaTeXMLpackage and to binding implementers. We have now processed the complete arXivcollection of more than 400,000 documents from 1993 until 2006 (one run is a processor-year-size undertaking) and have continuously improved our success rate to more than 56% (i.e. over 56% of the documents that are have been converted by LaTeXMLwithout noticing an error and are available as XHTML+MathML documents).