Building domain-specific web collections for scientific digital libraries: a meta-search enhanced focused crawling method

Authors:
Jialun Qin;Yilu Zhou;Michael Chau
Affiliations:
The University of Arizona, Tucson, AZ;The University of Arizona, Tucson, AZ;The University of Hong Kong, Pokfulam, Hong Kong
Venue:
Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Year:
2004

Citing 20
Cited 21

Automatic text processing

Automatic text processing
Scalable Internet resource discovery: research problems and approaches

Communications of the ACM
Inferring Web communities from link topology

Proceedings of the ninth ACM conference on Hypertext and hypermedia : links, objects, time and space---structure in hypermedia systems: links, objects, time and space---structure in hypermedia systems
An interactive WWW search engine for user-defined collections

Proceedings of the third ACM conference on Digital libraries
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
A smart itsy bitsy spider for the web

Journal of the American Society for Information Science - Special topic issue: artificial intelligence techniques for emerging information systems applications
Finding related pages in the World Wide Web

WWW '99 Proceedings of the eighth international conference on World Wide Web
Trawling the Web for emerging cyber-communities

WWW '99 Proceedings of the eighth international conference on World Wide Web
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Authoritative sources in a hyperlinked environment

Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
Efficient identification of Web communities

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Compiling document collections from the Internet

ACM SIGIR Forum
Creating a Web community chart for navigating related communities

Proceedings of the 12th ACM conference on Hypertext and Hypermedia
Collection synthesis

Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries
NanoPort: a web portal for nanoscale science and technology

Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries
Extracting Large-Scale Knowledge Bases from the Web

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Focused Crawls, Tunneling, and Digital Libraries

ECDL '02 Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries
Comparison of Three Vertical Search Spiders

Computer
WebGlimpse: combining browsing and searching

ATEC '97 Proceedings of the annual conference on USENIX Annual Technical Conference
The web as a graph: measurements, models, and methods

COCOON'99 Proceedings of the 5th annual international conference on Computing and combinatorics

Learning to crawl: Comparing classification schemes

ACM Transactions on Information Systems (TOIS)
Link Contexts in Classifier-Guided Topical Crawlers

IEEE Transactions on Knowledge and Data Engineering
Multilingual Web retrieval: An experiment in English–Chinese business intelligence

Journal of the American Society for Information Science and Technology
Structure-driven crawler generation by example

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Identification of time-varying objects on the web

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
CRAWLING THE CONSTRUCTION WEB-A MACHINE-LEARNING APPROACH WITHOUT NEGATIVE EXAMPLES

Applied Artificial Intelligence
Monitoring the status of a research community through a Knowledge Map

Web Intelligence and Agent Systems
Metadata domain-knowledge driven search engine in "HyperManyMedia" E-learning resources

CSTST '08 Proceedings of the 5th international conference on Soft computing as transdisciplinary science and technology
Focused Crawling with Heterogeneous Semantic Information

WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Finding what is missing from a digital library: A case study in the Computer Science field

Information Processing and Management: an International Journal
Profile-based focused crawling for social media-sharing websites

Journal on Image and Video Processing
Exploiting Tags and Social Profiles to Improve Focused Crawling

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Metadata as seeds for building an ontology driven information retrieval system

International Journal of Hybrid Intelligent Systems
PaMS: A component-based service for finding the missing full text of articles cataloged in a digital library

Information Systems
Addressing the limited scope problem of focused crawling using a result merging approach

Proceedings of the 2010 ACM Symposium on Applied Computing
Synonyms extraction using web content focused crawling

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
Design and implementation of contextual information portals

Proceedings of the 20th international conference companion on World wide web
Statistical approach to estimate the quality of web datasets

CIMMACS'05 Proceedings of the 4th WSEAS international conference on Computational intelligence, man-machine systems and cybernetics
Meta-search based web resource discovery for object-level vertical search

WISE'06 Proceedings of the 7th international conference on Web Information Systems
Schema driven and topic specific web crawling

DASFAA'05 Proceedings of the 10th international conference on Database Systems for Advanced Applications
Topical crawling on the web through local site-searches

Journal of Web Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Collecting domain-specific documents from the Web using focused crawlers has been considered one of the most important strategies to build digital libraries that serve the scientific community. However, because most focused crawlers use local search algorithms to traverse the Web space, they could be easily trapped within a limited sub-graph of the Web that surrounds the starting URLs and build domain-specific collections that are not comprehensive and diverse enough to scientists and researchers. In this study, we investigated the problems of traditional focused crawlers caused by local search algorithms and proposed a new crawling approach, meta-search enhanced focused crawling, to address the problems. We conducted two user evaluation experiments to examine the performance of our proposed approach and the results showed that our approach could build domain-specific collections with higher quality than traditional focused crawling techniques.