A machine learning approach to building domain-specific search engines

Authors:
Andrew McCallum;Kamal Nigam;Jason Rennie;Kristie Seymore
Affiliations:
Just Research, Pittsburgh, PA;School of Computer Science, Carnegie Mellon University, Pittsburgh, PA;School of Computer Science, Carnegie Mellon University, Pittsburgh, PA;School of Computer Science, Carnegie Mellon University, Pittsburgh, PA
Venue:
IJCAI'99 Proceedings of the 16th international joint conference on Artificial intelligence - Volume 2
Year:
1999

Citing 10
Cited 27

A public library based on full-text retrieval

Communications of the ACM
CiteSeer: an autonomous Web agent for automatic retrieval and identification of interesting publications

AGENTS '98 Proceedings of the second international conference on Autonomous agents
A Web-based information system that reasons with structured collections of text

AGENTS '98 Proceedings of the second international conference on Autonomous agents
Learning to extract symbolic knowledge from the World Wide Web

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
ARCCHNID: Adaptive Retrieval Agents Choosing Heuristic Neighborhoods

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Improving Text Classification by Shrinkage in a Hierarchy of Classes

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Using Reinforcement Learning to Spider the Web Efficiently

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Nymble: a high-performance learning name-finder

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Reinforcement learning: a survey

Journal of Artificial Intelligence Research

Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
An Evolutionary Approach to Automatic Web Page Categorization and Updating

WI '01 Proceedings of the First Asia-Pacific Conference on Web Intelligence: Research and Development
A Fast Algorithm for Hierarchical Text Classification

DaWaK 2000 Proceedings of the Second International Conference on Data Warehousing and Knowledge Discovery
Metadata Based Web Mining for Topic-Specific Information Gathering

EC-WEB '00 Proceedings of the First International Conference on Electronic Commerce and Web Technologies
Incremental Extraction of Keyterms for Classifying Multilingual Documents in the Web

PAKDD '02 Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Simple Estimators for Relational Bayesian Classifiers

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Leveraging Relational Autocorrelation with Latent Group Models

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Quality and relevance of domain-specific search: A case study in mental health

Information Retrieval
Identifying off-topic student essays without topic-specific training data

Natural Language Engineering
Bi-directional Joint Inference for Entity Resolution and Segmentation Using Imperatively-Defined Factor Graphs

ECML PKDD '09 Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part II
The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies

Journal of the ACM (JACM)
Semantic tagging of web search queries

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
Efficient social network approximate analysis on blogosphere based on network structure characteristics

Proceedings of the 3rd Workshop on Social Network Mining and Analysis
SEDE: An ontology for scholarly event description

Journal of Information Science
Collaborative information filtering by using categorized bookmarks on the web

INAP'01 Proceedings of the Applications of prolog 14th international conference on Web knowledge management and decision support
Using complex network features for fast clustering in the web

Proceedings of the 20th international conference companion on World wide web
Statistical approach to estimate the quality of web datasets

CIMMACS'05 Proceedings of the 4th WSEAS international conference on Computational intelligence, man-machine systems and cybernetics
Using content-based and link-based analysis in building vertical search engines

ICADL'04 Proceedings of the 7th international Conference on Digital Libraries: international collaboration and cross-fertilization
A new method for focused crawler cross tunnel

RSKT'06 Proceedings of the First international conference on Rough Sets and Knowledge Technology
Querying web images by topic and example specification methods

ADMA'05 Proceedings of the First international conference on Advanced Data Mining and Applications
Common sense reasoning – from cyc to intelligent assistant

Ambient Intelligence in Everyday Life
Discovery of environmental nodes in the web

IRFC'12 Proceedings of the 5th conference on Multidisciplinary Information Retrieval
Monte Carlo MCMC: efficient inference by approximate sampling

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Improved bibliographic reference parsing based on repeated patterns

TPDL'12 Proceedings of the Second international conference on Theory and Practice of Digital Libraries
Joint inference of entities, relations, and coreference

Proceedings of the 2013 workshop on Automated knowledge base construction
Text classification using a few labeled examples

Computers in Human Behavior
Mining closed patterns in relational, graph and network data

Annals of Mathematics and Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Domain-specific search engines are becoming increasingly popular because they offer increased accuracy and extra features not possible with general, Web-wide search engines. Unfortunately, they are also difficult and time-consuming to maintain. This paper proposes the use of machine learning techniques to greatly automate the creation and maintenance of domain-specific search engines. We describe new research in reinforcement learning, text classification and information extraction that enables efficient spidering, populates topic hierarchies, and identifies informative text segments. Using these techniques, we have built a demonstration system: a search engine for computer science research papers available at www.cora.justrcsettrch.com.