CRAWLING THE CONSTRUCTION WEB-A MACHINE-LEARNING APPROACH WITHOUT NEGATIVE EXAMPLES

Authors:
Milos Kovacevic;Colin H. Davidson
Affiliations:
University of Belgrade, School of Civil Engineering, Belgrade, Serbia;University of Montreal, School of Architecture, Montreal, Quebec, Canada
Venue:
Applied Artificial Intelligence
Year:
2008

Citing 28
Cited 0

Practical methods of optimization; (2nd ed.)

Practical methods of optimization; (2nd ed.)
Automatic text processing

Automatic text processing
Information retrieval in the World-Wide Web: making client-based searching feasible

Selected papers of the first conference on World-Wide Web
The nature of statistical learning theory

The nature of statistical learning theory
On Combining Classifiers

IEEE Transactions on Pattern Analysis and Machine Intelligence
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
The shark-search algorithm. An application: tailored Web site mapping

WWW7 Proceedings of the seventh international conference on World Wide Web 7
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Topical locality in the Web

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Adaptive Retrieval Agents: Internalizing Local Contextand Scaling up to the Web

Machine Learning - Special issue on information retrieval
Intelligent crawling on the World Wide Web with arbitrary predicates

Proceedings of the 10th international conference on World Wide Web
Evaluating topic-driven web crawlers

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Accelerated focused crawling through online relevance feedback

Proceedings of the 11th international conference on World Wide Web
Machine Learning

Machine Learning
Automating the Construction of Internet Portals with Machine Learning

Information Retrieval
CI Spider: a tool for competitive intelligence on the web

Decision Support Systems
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
PEBL: Web Page Classification without Negative Examples

IEEE Transactions on Knowledge and Data Engineering
Building domain-specific web collections for scientific digital libraries: a meta-search enhanced focused crawling method

Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Panorama: extending digital libraries with topical crawlers

Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
A General Evaluation Framework for Topical Crawlers

Information Retrieval
Lexical and semantic clustering by web links

Journal of the American Society for Information Science and Technology - Special issue: Webometrics
Learning to crawl: Comparing classification schemes

ACM Transactions on Information Systems (TOIS)
Estimating the Support of a High-Dimensional Distribution

Neural Computation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Professionals and craftsmen in the construction sector make an intensive use of information in their decision-making processes but only make limited use of the abundant information that is potentially available to them, particularly on the web. Consequently, designs are impoverished, construction is defective, and innovation is delayed. To facilitate convivial access to focused information, we have developed a question-and-answer (Q-A) system (reported elsewhere). To support this system, we have developed an automated crawler that permits the establishment of a bank of relevant pages, adapted to the needs of this particular industry-user community. It is based on the machine-learning framework in which an intelligent decision unit is trained to distinguish between nontopic and informative pages. We show that standard approaches which use both positive and negative classes are sensitive to the noise in the negative class. We propose different techniques for learning without negative examples, since initially one only has limited, positive information labeled by human experts; they are evaluated. Our crawler that uses the positive examples-based learning (PEBL) framework is able to collect construction-oriented pages with high precision and discovery rate. It can also be used to build domain-specific collections of pages in different scientific or professional contexts.