Injecting information into atomic units of text

Authors:
Yannis Haralambous;Gábor Bella
Affiliations:
Département Informatique, ENST Bretagne, France;Département Informatique, ENST Bretagne, France
Venue:
Proceedings of the 2005 ACM symposium on Document engineering
Year:
2005

Citing 0
Cited 1

WebKhoj: Indian language IR from multiple character encodings

Proceedings of the 15th international conference on World Wide Web

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a new approach to text processing, based on textemes. These are atomic text units generalising the concepts of character and glyph by merging them in a common data structure, together with an arbitrary number of user-defined properties. In the first part, we give a survey of the notions of character and glyph and their relation with Natural Language Processing models, some visual text representation issues and strategies adopted by file formats (SVG, PDF, DVI) and software (Uniscribe, Pango). In the second part we show applications of textemes in various text processing issues: ligatures, variant glyphs and other OpenType-related properties, hyphenation, color and other presentation attributes, Arabic form and morphology, CJK spacing, metadata, etc. Finally we describe how the Omega typesetting system implements texteme processing as an example of a generalised approach to input character stream parsing, internal representation of text, and modular typographic transformations. In the data flow from input to output, whether in memory or through serializations in auxiliary data files, textemes progressively accumulate information that is used by Omega's paragraph builder engine and included in the output DVI file. We show how this additional information increases efficiency of conversions to other file formats such as PDF or SVG. We conclude this paper by presenting interesting potential applications of texteme methods in document engineering.