Automating the Construction of Internet Portals with Machine Learning

Authors:
Andrew Kachites McCallum;Kamal Nigam;Jason Rennie;Kristie Seymore
Affiliations:
Just Research and Carnegie Mellon University. mccallum@cs.cmu.edu;Carnegie Mellon University. knigam@cs.cmu.edu;Massachusetts Institute of Technology. jrennie@ai.mit.edu;Carnegie Mellon University. kseymore@ri.cmu.edu
Venue:
Information Retrieval
Year:
2000

Citing 23
Cited 116

A public library based on full-text retrieval

Communications of the ACM
CiteSeer: an automatic citation indexing system

Proceedings of the third ACM conference on Digital libraries
Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
A Web-based information system that reasons with structured collections of text

AGENTS '98 Proceedings of the second international conference on Autonomous agents
Learning to extract symbolic knowledge from the World Wide Web

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Learning dictionaries for information extraction by multi-level bootstrapping

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Authoritative sources in a hyperlinked environment

Journal of the ACM (JACM)
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Machine Learning

Machine Learning
Digital Libraries and Autonomous Citation Indexing

Computer
Learning to Predict by the Methods of Temporal Differences

Machine Learning
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval

ECML '98 Proceedings of the 10th European Conference on Machine Learning
ARCCHNID: Adaptive Retrieval Agents Choosing Heuristic Neighborhoods

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Improving Text Classification by Shrinkage in a Hierarchy of Classes

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Information Extraction with HMM Structures Learned by Stochastic Optimization

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Dynamic Programming

Dynamic Programming
Statistical Models for Co-occurrence Data

Statistical Models for Co-occurrence Data
Tagging English text with a probabilistic model

Computational Linguistics
Nymble: a high-performance learning name-finder

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Unsupervised word sense disambiguation rivaling supervised methods

ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
Reinforcement learning: a survey

Journal of Artificial Intelligence Research

Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Text categorization for multi-page documents: a hybrid naive Bayes HMM approach

Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries
Stable algorithms for link analysis

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Finding scientific papers with homepagesearch and MOPS

SIGDOC '01 Proceedings of the 19th annual international conference on Computer documentation
A machine learning based approach for table detection on the web

Proceedings of the 11th international conference on World Wide Web
Collection synthesis

Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries
Hidden Markov Models for Text Categorization in Multi-Page Documents

Journal of Intelligent Information Systems
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Detecting Tables in HTML Documents

DAS '02 Proceedings of the 5th International Workshop on Document Analysis Systems V
Learning to match and cluster large high-dimensional data sets for data integration

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic document metadata extraction using support vector machines

Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Learning probabilistic models of link structure

The Journal of Machine Learning Research
Capturing interest through inference and visualization: ontological user profiling in recommender systems

Proceedings of the 2nd international conference on Knowledge capture
Bootstrapping for hierarchical document classification

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Ontological user profiling in recommender systems

ACM Transactions on Information Systems (TOIS)
Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Crawling a country: better strategies than breadth-first for web page ordering

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Automatic generation of web portals using artificial ants

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Reference reconciliation in complex information spaces

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Rule-based word clustering for document metadata extraction

Proceedings of the 2005 ACM symposium on Applied computing
Learning to crawl: Comparing classification schemes

ACM Transactions on Information Systems (TOIS)
Link Contexts in Classifier-Guided Topical Crawlers

IEEE Transactions on Knowledge and Data Engineering
Distribution-based aggregation for relational learning with identifier attributes

Machine Learning
Using the web as a bilingual dictionary

DMMT '01 Proceedings of the workshop on Data-driven methods in machine translation - Volume 14
Bibliometric impact measures leveraging topic analysis

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Information extraction from research papers using conditional random fields

Information Processing and Management: an International Journal
Linear prediction models with graph regularization for web-page categorization

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Latent linkage semantic kernels for collective classification of link data

Journal of Intelligent Information Systems
Learning Contextual Dependency Network Models for Link-Based Classification

IEEE Transactions on Knowledge and Data Engineering
Combining text and link analysis for focused crawling-An application for vertical search engines

Information Systems
Classification in Networked Data: A Toolkit and a Univariate Case Study

The Journal of Machine Learning Research
Searching and retrieving legal literature through automated semantic indexing

Proceedings of the 11th international conference on Artificial intelligence and law
Combining content and link for classification using matrix factorization

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Extracting relevant named entities for automated expense reimbursement

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Extracting reusable knowledge from portal activity

AIKED'05 Proceedings of the 4th WSEAS International Conference on Artificial Intelligence, Knowledge Engineering Data Bases
The impact of term selection in genre-aware focused crawling

Proceedings of the 2008 ACM symposium on Applied computing
Access to Italian legal literature: integration between structured repositories and web documents

DCMI '03 Proceedings of the 2003 international conference on Dublin Core and metadata applications: supporting communities of discourse and practice---metadata research & applications
Efficient multiclass maximum margin clustering

Proceedings of the 25th international conference on Machine learning
Effective label acquisition for collective classification

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Cuts3vm: a fast semi-supervised svm algorithm

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Generating Social Network Features for Link-Based Classification

PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
CRAWLING THE CONSTRUCTION WEB-A MACHINE-LEARNING APPROACH WITHOUT NEGATIVE EXAMPLES

Applied Artificial Intelligence
Joke retrieval: recognizing the same joke told differently

Proceedings of the 17th ACM conference on Information and knowledge management
Topic models and a revisit of text-related applications

Proceedings of the 2nd PhD workshop on Information and knowledge management
Text classification from unlabeled documents with bootstrapping and feature projection techniques

Information Processing and Management: an International Journal
Graph nodes clustering with the sigmoid commute-time kernel: A comparative study

Data & Knowledge Engineering
Topic-link LDA: joint models of topic and author community

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Regret-based online ranking for a growing digital library

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Heterogeneous source consensus learning via decision propagation and negotiation

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Using graph-based metrics with empirical risk minimization to speed up active learning on networked data

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Combining link and content for community detection: a discriminative approach

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Learning Process Behavior with EDY: an Experimental Analysis

Proceedings of the 2008 conference on STAIRS 2008: Proceedings of the Fourth Starting AI Researchers' Symposium
A Fast Method for Property Prediction in Graph-Structured Data from Positive and Unlabelled Examples

Proceedings of the 2008 conference on ECAI 2008: 18th European Conference on Artificial Intelligence
A latent topic model for linked documents

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
A Genre-Aware Approach to Focused Crawling

World Wide Web
Improving learning in networked data by combining explicit and mined links

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 1
Cautious inference in collective classification

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 1
Multi-view local learning

AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 2
Change of representation for statistical relational learning

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Reflect and correct: A misclassification prediction approach to active inference

ACM Transactions on Knowledge Discovery from Data (TKDD)
Probabilistic classification and clustering in relational data

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2
Link analysis, eigenvectors and stability

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2
Inferring useful heuristics from the dynamics of iterative relational classifiers

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Integrated access to legal literature through automated semantic classification

Artificial Intelligence and Law
Relation regularized matrix factorization

IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
Using web resources for support of online-browsing of research papers

IRI'09 Proceedings of the 10th IEEE international conference on Information Reuse & Integration
Learning Link-Based Naïve Bayes Classifiers from Ontology-Extended Distributed Data

OTM '09 Proceedings of the Confederated International Conferences, CoopIS, DOA, IS, and ODBASE 2009 on On the Move to Meaningful Internet Systems: Part II
Cautious Collective Classification

The Journal of Machine Learning Research
When are links useful? experiments in text classification

ECIR'03 Proceedings of the 25th European conference on IR research
Modelling citation networks for improving scientific paper classification performance

PRICAI'06 Proceedings of the 9th Pacific Rim international conference on Artificial intelligence
Ontological technologies for user modelling

International Journal of Metadata, Semantics and Ontologies
User profiles for personalized information access

The adaptive web
Linear time maximum margin clustering

IEEE Transactions on Neural Networks
Exploiting genre in focused crawling

SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
Local soft belief updating for relational classification

ISMIS'08 Proceedings of the 17th international conference on Foundations of intelligent systems
Empirical comparison of "hard" and "soft" label propagation for relational classification

ILP'07 Proceedings of the 17th international conference on Inductive logic programming
Bisimulation-based approximate lifted inference

UAI '09 Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence
A Bayesian framework for community detection integrating content and link

UAI '09 Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence
Object search: supporting structured queries in web search engines

SS '10 Proceedings of the NAACL HLT 2010 Workshop on Semantic Search
Local adaptive extraction of references

KI'10 Proceedings of the 33rd annual German conference on Advances in artificial intelligence
Manifold-respecting discriminant nonnegative matrix factorization

Pattern Recognition Letters
Tuffy: scaling up statistical inference in Markov logic networks using an RDBMS

Proceedings of the VLDB Endowment
Health: related information structuring for the semantic web

Proceedings of the 2011 International Conference on Intelligent Semantic Web-Services and Applications
Refining graph partitioning for social network clustering

WISE'10 Proceedings of the 11th international conference on Web information systems engineering
Composite hashing with multiple information sources

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Multilingual document mining and navigation using self-organizing maps

Information Processing and Management: an International Journal
An unsupervised heuristic-based approach for bibliographic metadata deduplication

Information Processing and Management: an International Journal
K-means based approaches to clustering nodes in annotated graphs

ISMIS'11 Proceedings of the 19th international conference on Foundations of intelligent systems
Combining link-based and content-based classification method

WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
Collective prediction with latent graphs

Proceedings of the 20th ACM international conference on Information and knowledge management
Distance Dependent Chinese Restaurant Processes

The Journal of Machine Learning Research
A sequence labeling method using syntactical and textual patterns for record linkage

ICAPR'05 Proceedings of the Third international conference on Advances in Pattern Recognition - Volume Part I
Combining contents and citations for scientific document classification

AI'05 Proceedings of the 18th Australian Joint conference on Advances in Artificial Intelligence
Iterative relational classification through three–state epidemic dynamics

ISI'06 Proceedings of the 4th IEEE international conference on Intelligence and Security Informatics
ExpertiseNet: relational and evolutionary expert modeling

UM'05 Proceedings of the 10th international conference on User Modeling
Finding hidden semantics behind reference linkages: an ontological approach for scientific digital libraries

DASFAA'05 Proceedings of the 10th international conference on Database Systems for Advanced Applications
Automatic extraction and resolution of bibliographical references in patent documents

IRFC'10 Proceedings of the First international Information Retrieval Facility conference on Adbances in Multidisciplinary Retrieval
UP-DRES: user profiling for a dynamic REcommendation system

ICDM'06 Proceedings of the 6th Industrial Conference on Data Mining conference on Advances in Data Mining: applications in Medicine, Web Mining, Marketing, Image and Signal Mining
Prolog performance on larger datasets

PADL'07 Proceedings of the 9th international conference on Practical Aspects of Declarative Languages
Directed laplacian kernels for link analysis

IDEAL'06 Proceedings of the 7th international conference on Intelligent Data Engineering and Automated Learning
Combining supervised and unsupervised models via unconstrained probabilistic embedding

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Two
Inductive multi-task learning with multiple view data

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Comparative document summarization via discriminative sentence selection

ACM Transactions on Knowledge Discovery from Data (TKDD)
LinkFCM: Relation integrated fuzzy c-means

Pattern Recognition
Information-theoretic multi-view domain adaptation

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2
Discovering diverse and salient threads in document collections

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
A regularization framework in polar coordinates for transductive learning in networked data

Information Sciences: an International Journal
Comparative Document Summarization via Discriminative Sentence Selection

ACM Transactions on Knowledge Discovery from Data (TKDD)
Fast rank-2 nonnegative matrix factorization for hierarchical document clustering

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Model of complex networks based on citation dynamics

Proceedings of the 22nd international conference on World Wide Web companion
Community detection by popularity based models for authored networked data

Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
Efficient simrank-based similarity join over large graphs

Proceedings of the VLDB Endowment
Generalized relational topic models with data augmentation

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Multi-view discriminant transfer learning

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Attributed graph models: modeling network structure with correlated attributes

Proceedings of the 23rd international conference on World wide web
Exploiting small world property for network clustering

World Wide Web

Quantified Score

Hi-index	0.00

Visualization

Abstract

Domain-specific internet portals are growing in popularity because they gather content from the Web and organize it for easy access, retrieval and search. For example, www.campsearch.com allows complex queries by age, location, cost and specialty over summer camps. This functionality is not possible with general, Web-wide search engines. Unfortunately these portals are difficult and time-consuming to maintain. This paper advocates the use of machine learning techniques to greatly automate the creation and maintenance of domain-specific Internet portals. We describe new research in reinforcement learning, information extraction and text classification that enables efficient spidering, the identification of informative text segments, and the population of topic hierarchies. Using these techniques, we have built a demonstration system: a portal for computer science research papers. It already contains over 50,000 papers and is publicly available at www.cora.justresearch.com. These techniques are widely applicable to portal creation in other domains.