Abramowitz and stegun: a resource for mathematical document analysis
CICM'12 Proceedings of the 11th international conference on Intelligent Computer Mathematics
MathWebSearch 0.5: scaling an open formula search engine
CICM'12 Proceedings of the 11th international conference on Intelligent Computer Mathematics
Hi-index | 0.00 |
We describe an experiment of transforming large collections of documents to more machine-understandable representations. Concretely, we are translating the collection of scientific publications of the Cornell e-Print Archive (arXiv) using the to XML converter which is currently under development.The main technical task of our arXMLivproject is to supply LaTeXMLbindings for the (thousands of) classes and packages used in the arXivcollection. For this we have developed a distributed build system that reiteratively runs LaTeXMLover the arXivcollection and collects statistics about e.g. the most sorely missing LaTeXMLbindings and clusters common error events. This creates valuable feedback to both the developers of the LaTeXMLpackage and to binding implementers. We have now processed the complete arXivcollection of more than 400,000 documents from 1993 until 2006 (one run is a processor-year-size undertaking) and have continuously improved our success rate to more than 56% (i.e. over 56% of the documents that are have been converted by LaTeXMLwithout noticing an error and are available as XHTML+MathML documents).