Learning to predict code-switching points
EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Part-of-speech tagging for English-Spanish code-switched text
EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Hi-index | 0.00 |
We propose a comprehensive theory of code-mixed discourse, encompassing equivalencepoint and insertional code-switching, palindromic constructions and lexical borrowing. The starting point is a production model of code-switching acconting for empirical observations about switch-point distribution (the equivalence constraint), well-formedness of monolingual fragments, conservation of constituent structure and lack of constraint between successive switch points, without invoking any "code-switching grammar". Code-switched sentence production makes alternate reference to two virtual monolingual sentences, one in each language, and is based on conservative conditions on language labeling of constituents, together with a constraint against real-time "look-ahead" from one code-switch to the next. Selective weakening of model conditions can produce (i) the type of palindromic (or portmanteau) construction occasionally occurring e.g., in switches between prepositional and postpositional languages, (ii) the switching by "insertion" of very specific kinds of constituent reported e.g., for French noun phrases in switching with Arabic and, most important, (iii) lexical borrowing. Borrowing can create ambiguity as to language membership of sentence items, but the model predicts where this can be resolved, and the confirmation of these predictions, based on empirical studies of inflectional morphology, validates key aspects of the model.