Focused crawling for both topical relevance and quality of medical information

Authors:
Thanh Tin Tang;David Hawking;Nick Craswell;Kathy Griffiths
Affiliations:
ANU, Canberra, Australia;CSIRO ICT Centre, Canberra, Australia;Microsoft Research, Cambridge, UK;ANU, Australia
Venue:
Proceedings of the 14th ACM international conference on Information and knowledge management
Year:
2005

Citing 13
Cited 12

Towards interactive query expansion

SIGIR '88 Proceedings of the 11th annual international ACM SIGIR conference on Research and development in information retrieval
On term selection for query expansion

Journal of Documentation
C4.5: programs for machine learning

C4.5: programs for machine learning
Automatic resource compilation by analyzing hyperlink structure and associated text

WWW7 Proceedings of the seventh international conference on World Wide Web 7
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
The shark-search algorithm. An application: tailored Web site mapping

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
Topical locality in the Web

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
On the design of a learning crawler for topical resource discovery

ACM Transactions on Information Systems (TOIS)
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Quality and relevance of domain-specific search: A case study in mental health

Information Retrieval

Estimating the global pagerank of web communities

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
An automatic approach to construct domain-specific web portals

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Interactive high-quality text classification

Information Processing and Management: an International Journal
Urban web crawling

Proceedings of the first international workshop on Location and the web
Quality-Oriented Search for Depression Portals

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Adaptive geospatially focused crawling

Proceedings of the 18th ACM conference on Information and knowledge management
Text classification for healthcare information support

IEA/AIE'07 Proceedings of the 20th international conference on Industrial, engineering, and other applications of applied intelligent systems
Addressing the limited scope problem of focused crawling using a result merging approach

Proceedings of the 2010 ACM Symposium on Applied Computing
A topic-specific web search system focusing on quality pages

ECDL'10 Proceedings of the 14th European conference on Research and advanced technology for digital libraries
Beliefs and biases in web search

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Workshop on health search and discovery: helping users and advancing medicine

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Report on the SIGIR 2013 workshop on health search and discovery

ACM SIGIR Forum

Quantified Score

Hi-index	0.00

Visualization

Abstract

Subject-specific search facilities on health sites are usually built using manual inclusion and exclusion rules. These can be expensive to maintain and often provide incomplete coverage of Web resources. On the other hand, health information obtained through whole-of-Web search may not be scientifically based and can be potentially harmful.To address problems of cost, coverage and quality, we built a focused crawler for the mental health topic of depression, which was able to selectively fetch higher quality relevant information. We found that the relevance of unfetched pages can be predicted based on link anchor context, but the quality cannot. We therefore estimated quality of the entire linking page, using a learned IR-style query of weighted single words and word pairs, and used this to predict the quality of its links. The overall crawler priority was determined by the product of link relevance and source quality.We evaluated our crawler against baseline crawls using both relevance judgments and objective site quality scores obtained using an evidence-based rating scale. Both a relevance focused crawler and the quality focused crawler retrieved twice as many relevant pages as a breadth-first control. The quality focused crawler was quite effective in reducing the amount of low quality material fetched while crawling more high quality content, relative to the relevance focused crawler.Analysis suggests that quality of content might be improved by post-filtering a very big breadth-first crawl, at the cost of substantially increased network traffic.