Transforming the ar Χ iv to XML

  • Authors:
  • Heinrich Stamerjohanns;Michael Kohlhase

  • Affiliations:
  • Computer Science, Jacobs University Bremen,;Computer Science, Jacobs University Bremen,

  • Venue:
  • Proceedings of the 9th AISC international conference, the 15th Calculemas symposium, and the 7th international MKM conference on Intelligent Computer Mathematics
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

We describe an experiment of transforming large collections of documents to more machine-understandable representations. Concretely, we are translating the collection of scientific publications of the Cornell e-Print Archive (arXiv) using the to XML converter which is currently under development.The main technical task of our arXMLivproject is to supply LaTeXMLbindings for the (thousands of) classes and packages used in the arXivcollection. For this we have developed a distributed build system that reiteratively runs LaTeXMLover the arXivcollection and collects statistics about e.g. the most sorely missing LaTeXMLbindings and clusters common error events. This creates valuable feedback to both the developers of the LaTeXMLpackage and to binding implementers. We have now processed the complete arXivcollection of more than 400,000 documents from 1993 until 2006 (one run is a processor-year-size undertaking) and have continuously improved our success rate to more than 56% (i.e. over 56% of the documents that are have been converted by LaTeXMLwithout noticing an error and are available as XHTML+MathML documents).