Social (distributed) language modeling, clustering and dialectometry
TextGraphs-4 Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing
Hi-index | 0.00 |
Modern statistical natural language processing techniques require large amounts of human-annotated data to work well. For practical reasons, the required amount of data exists only for a few languages of major interest. In my work I show how a resource-rich language can be leveraged to produce the necessary resources and tools for related resource-poor languages. The work consists of two parts. The first part focuses on building a word-to-word translation model from parallel corpora. This involved a variety of methods, some well-known and some new. The new methods focus on exploiting lexical and syntactic similarities of the languages. The second part utilized the word-to-word model created in the first part, to first assign parts of speech and then parse the text in several related resource-poor languages.