DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
An experimental workflow development platform for historical document digitisation and analysis
Proceedings of the 2011 Workshop on Historical Document Imaging and Processing
Hi-index | 0.00 |
This paper presents a new semi-supervised clustering framework to the recognition of heavily degraded characters in historical typewritten documents, where off-the-shelf OCR typically fails. The constraints are generated using typographical (collection-independent) domain knowledge and are used to guide both sample (glyph set) partitioning and metric learning. Experimental results using simple features provide encouraging evidence that this approach can lead to significantly improved clustering results compared to simple K-Means clustering, as well as to clustering using a state-of-the art OCR engine.