Modeling Information in Textual Data Combining Labeled and Unlabeled Data

Authors:
Dunja Mladenic
Affiliations:
-
Venue:
Proceedings of the ESF Exploratory Workshop on Pattern Detection and Discovery
Year:
2002

Citing 8
Cited 1

Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Analyzing the effectiveness and applicability of co-training

Proceedings of the ninth international conference on Information and knowledge management
Learning to lemmatise slovene words

Learning language in logic
Automatic web search query generation to create minority language corpora

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Feature Selection for Unbalanced Class Distribution and Naive Bayes

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Unsupervised word sense disambiguation rivaling supervised methods

ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
Learning the past tense of English verbs: the symbolic pattern associator vs. connectionist models

Journal of Artificial Intelligence Research

Analyzing Co-training Style Algorithms

ECML '07 Proceedings of the 18th European conference on Machine Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

The paper describes two approaches to modeling word normalization (such as replacing "wrote" or "writing" by "write") based on the re-occurring patterns in: word suffix and the context of word obtained from texts. In order to collect patterns, we first represent the data using two independent feature sets and then find the patterns responsible for a particular word mapping. The modeling is based on a set of hand-labeled words of the form (word, normalized word) and texts from 28 novels obtained from the Web and used to get words context. Since the hand-labeling is a demanding task we investigate the possibility of improving our modeling by gradually adding unlabeled examples. Namely, we use the initial model based on word suffix to predict the labels. Then we enlarge the training set by the examples with predicted labels for which the model is the most certain. The experiment show that this helps the context-based approach while largely hurting the suffix-based approach. To get an idea of the influence of the number of labeled instead of unlabeled examples, we give a comparison with the situation when simply more labeled data is provided.