Panorama: extending digital libraries with topical crawlers

Authors:
Gautam Pant;Kostas Tsioutsiouliklis;Judy Johnson;C. Lee Giles
Affiliations:
The University of Iowa, Iowa City, IA;NEC Laboratories America: Inc., Princeton, NJ;NEC Laboratories America: Inc., Princeton, NJ;The Pennsylvania State University, University Park, PA
Venue:
Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Year:
2004

Citing 29
Cited 16

Scatter/Gather: a cluster-based approach to browsing large document collections

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Information retrieval in the World-Wide Web: making client-based searching feasible

Selected papers of the first conference on World-Wide Web
Improved algorithms for topic distillation in a hyperlinked environment

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Learning to extract symbolic knowledge from the World Wide Web

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Grouper: a dynamic clustering interface to Web search results

WWW '99 Proceedings of the eighth international conference on World Wide Web
Trawling the Web for emerging cyber-communities

WWW '99 Proceedings of the eighth international conference on World Wide Web
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
A system for automatic personalized tracking of scientific literature on the Web

Proceedings of the fourth ACM conference on Digital libraries
Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
Authoritative sources in a hyperlinked environment

Journal of the ACM (JACM)
Bringing order to the Web: automatically categorizing search results

Proceedings of the SIGCHI conference on Human Factors in Computing Systems
Adaptive Retrieval Agents: Internalizing Local Contextand Scaling up to the Web

Machine Learning - Special issue on information retrieval
Intelligent crawling on the World Wide Web with arbitrary predicates

Proceedings of the 10th international conference on World Wide Web
Personalized spiders for web search and analysis

Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries
Evaluating topic-driven web crawlers

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Accelerated focused crawling through online relevance feedback

Proceedings of the 11th international conference on World Wide Web
Collection synthesis

Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries
Information navigation on the web by clustering and summarizing query results

Information Processing and Management: an International Journal
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Evaluating contents-link coupled web page clustering for web search results

Proceedings of the eleventh international conference on Information and knowledge management
Evaluation of hierarchical clustering algorithms for document datasets

Proceedings of the eleventh international conference on Information and knowledge management
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Focused Crawls, Tunneling, and Digital Libraries

ECDL '02 Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries
Stochastic models for the Web graph

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Automatic document metadata extraction using support vector machines

Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
The Web-DL environment for building digital libraries from the Web

Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
The SMART Retrieval System—Experiments in Automatic Document Processing

The SMART Retrieval System—Experiments in Automatic Document Processing

Digital libraries' support for the user's 'information journey'

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
What's there and what's not?: focused crawling for missing documents in digital libraries

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Learning to crawl: Comparing classification schemes

ACM Transactions on Information Systems (TOIS)
Link Contexts in Classifier-Guided Topical Crawlers

IEEE Transactions on Knowledge and Data Engineering
Using HMM to learn user browsing patterns for focused web crawling

Data & Knowledge Engineering - Special issue: WIDM 2004
Agreeing to disagree: search engines and their public interfaces

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
The impact of term selection in genre-aware focused crawling

Proceedings of the 2008 ACM symposium on Applied computing
Enhancing digital libraries using missing content analysis

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
CRAWLING THE CONSTRUCTION WEB-A MACHINE-LEARNING APPROACH WITHOUT NEGATIVE EXAMPLES

Applied Artificial Intelligence
Metadata domain-knowledge driven search engine in "HyperManyMedia" E-learning resources

CSTST '08 Proceedings of the 5th international conference on Soft computing as transdisciplinary science and technology
Improving the performance of focused web crawlers

Data & Knowledge Engineering
A Genre-Aware Approach to Focused Crawling

World Wide Web
Metadata as seeds for building an ontology driven information retrieval system

International Journal of Hybrid Intelligent Systems
Exploiting genre in focused crawling

SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
Beyond digital incunabula: modeling the next generation of digital libraries

ECDL'06 Proceedings of the 10th European conference on Research and Advanced Technology for Digital Libraries
PROBABILISTIC MODELS FOR FOCUSED WEB CRAWLING

Computational Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

A large amount of research, technical and professional documents are available today in digital formats Digital libraries are created to facilitate search and retrieval of information supplied by the documents. These libraries may span an entire area of interest (e.g., computer science) or be limited to documents within a small organization. While tools that index, classify, rank and retrieve documents from such libraries are important, it would be worthwhile to complement these tools with information available on the Web. We propose one such technique that uses a topical crawler driven by the information extracted from a research document. The goal of the crawler is to harvest a collection of Web pages that are focused on the topical subspaces associated with the given document. The collection created through Web crawling is further processed using lexical and linkage analysis. The entire process is automated and uses machine learning techniques to both guide the crawler as well as analyze the collection it fetches. A report is generated at the end that provides visual cues and information to the researcher.