The effect of domain and text type on text prediction quality

Authors:
Suzan Verberne;Antal van den Bosch;Helmer Strik;Lou Boves
Affiliations:
Radboud University Nijmegen;Radboud University Nijmegen;Radboud University Nijmegen;Radboud University Nijmegen
Venue:
EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Year:
2012

Citing 7
Cited 0

IGTree: Using Trees for Compression and Classification in Lazy LearningAlgorithms

Artificial Intelligence Review - Special issue on lazy learning
FASTY - A Multi-lingual Approach to Text Prediction

ICCHP '02 Proceedings of the 8th International Conference on Computers Helping People with Special Needs
Text prediction systems: a survey

Universal Access in the Information Society
Re-phrase: chat-by-click: a fundamental new mode of human communication over the internet

CHI '08 Extended Abstracts on Human Factors in Computing Systems
Efficient context-sensitive word completion for mobile devices

Proceedings of the 10th international conference on Human computer interaction with mobile devices and services
Testing the efficacy of part-of-speech information in word completion

TextEntry '03 Proceedings of the 2003 EACL Workshop on Language Modeling for Text Entry Methods
Information interaction in 140 characters or less: genres on twitter

Proceedings of the third symposium on Information interaction in context

Quantified Score

Hi-index	0.00

Visualization

Abstract

Text prediction is the task of suggesting text while the user is typing. Its main aim is to reduce the number of keystrokes that are needed to type a text. In this paper, we address the influence of text type and domain differences on text prediction quality. By training and testing our text prediction algorithm on four different text types (Wikipedia, Twitter, transcriptions of conversational speech and FAQ) with equal corpus sizes, we found that there is a clear effect of text type on text prediction quality: training and testing on the same text type gave percentages of saved keystrokes between 27 and 34%; training on a different text type caused the scores to drop to percentages between 16 and 28%. In our case study, we compared a number of training corpora for a specific data set for which training data is sparse: questions about neurological issues. We found that both text type and topic domain play a role in text prediction quality. The best performing training corpus was a set of medical pages from Wikipedia. The second-best result was obtained by leave-one-out experiments on the test questions, even though this training corpus was much smaller (2,672 words) than the other corpora (1.5 Million words).