Abstraction is harmful in language learning

Authors:
Walter Daelemans
Affiliations:
Tilburg University, The Netherlands and University of Antwerp, Belgium
Venue:
NeMLaP3/CoNLL '98 Proceedings of the Joint Conferences on New Methods in Language Processing and Computational Natural Language Learning
Year:
1998

Citing 2
Cited 0

IGTree: Using Trees for Compression and Classification in Lazy LearningAlgorithms

Artificial Intelligence Review - Special issue on lazy learning
Similarity-based methods for word sense disambiguation

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

The usual approach to learning language processing tasks such as tagging, parsing, grapheme-to-phoneme conversion, pp-attachment, etc., is to extract regularities from training data in the form of decision trees, rules, probabilities or other abstractions. These representations of regularities are then used to solve new cases of the task. The individual training examples on which the abstractions were based are discarded (forgotten). While this approach seems to work well for other application areas of Machine Learning, I will show that there is evidence that it is not the best way to learn language processing tasks. I will briefly review empirical work in our groups in Antwerp and Tilburg on lazy language learning. In this approach (also called, instance-based, case-based, memory-based, and example-based learning), generalization happens at processing time by means of extrapolation from the most similar items in memory to the new item being processed. Lazy Learning with a simple similarity metric based on information entropy (IB1-IG, Daelemans & van den Bosch, 1992, 1997) consistently outperforms abstracting (greedy) learning techniques such as C5.0 or backprop learning on a broad selection of natural language processing tasks ranging from phonology to semantics. Our intuitive explanation for this result is that lazy learning techniques keep all training items, whereas greedy approaches lose useful information by forgetting low-frequency or exceptional instances of the task, not covered by the extracted rules or models (Daelemans, 1996). Apart from the empirical work in Tilburg and Antwerp, a number of recent studies on statistical natural language processing (e.g. Dagan & Lee, 1997; Collins & Brooks, 1995) also suggest that, contrary to common wisdom, forgetting specific training items, even when they represent extremely low-frequency events, is harmful to generalization accuracy. After reviewing this empirical work briefly, I will report on new results (work in progress in collaboration with van den Bosch and Zavrel), systematically comparing greedy and lazy learning techniques on a number of benchmark natural language processing tasks: tagging, grapheme-to-phoneme conversion, and pp-attachment. The results show that forgetting individual training items, however 'improbable' they may be, is indeed harmful. Furthermore, they show that combining lazy learning with training set editing techniques (based on typicality and other regularity criteria) also leads to worse generalization results. I will conclude that forgetting, either by abstracting from the training data or by editing exceptional training items in lazy learning is harmful to generalization accuracy, and will attempt to provide an explanation for these unexpected results.