Extracting Information from the Web for Concept Learning and Collaborative Filtering

Authors:
William W. Cohen
Affiliations:
-
Venue:
ALT '00 Proceedings of the 11th International Conference on Algorithmic Learning Theory
Year:
2000

Citing 12
Cited 0

Automatic text processing

Automatic text processing
Boolean Feature Discovery in Empirical Learning

Machine Learning
Recommending and evaluating choices in a virtual community of use

CHI '95 Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Social information filtering: algorithms for automating “word of mouth”

CHI '95 Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Integration of heterogeneous databases without common domains using queries based on textual similarity

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Learning to classify text from labeled and unlabeled documents

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Recognizing structure in Web pages using similarity queries

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Web-collaborative filtering: recommending music by crawling the Web

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Automatically Extracting Features for Concept Learning from the Web

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Rerepresenting and restructuring domain theories: a constructive induction approach

Journal of Artificial Intelligence Research
Concept learning and the problem of small disjuncts

IJCAI'89 Proceedings of the 11th international joint conference on Artificial intelligence - Volume 1

Quantified Score

Hi-index	0.00

Visualization

Abstract

Previous work on extracting information from the web generally makes few assumptions about how the extracted information will be used. As a consequence, the goal of web-based extraction systems is usually taken to be the creation of high-quality, noise-free data with clear semantics. This is a difficult problem which cannot be completely automated. Here we consider instead the problem of extracting web data for certain machine learning systems: specifically, collaborative filtering (CF) and concept learning (CL) systems. CF and CL systems are highly tolerant of noisy input, and hence much simpler extraction systems can be used in this context. For CL, we will describe a simple method that uses a given set of web pages to construct new features, which reduce the error rate of learned classifiers in a wide variety of situations. For CF, we will describe a simple method that automatically collects useful information from the web without any human intervention. The collected information, represented as "pseudo-users", can be used to "jumpstart" a CF system when the user base is small (or even absent).