Marie-4: A High-Recall, Self-Improving Web Crawler That Finds Images Using Captions

Authors:
Neil C. Rowe
Affiliations:
-
Venue:
IEEE Intelligent Systems
Year:
2002

Citing 6
Cited 5

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Automatic caption localization for photographs on World Wide Web pages

Information Processing and Management: an International Journal
Attributes of images in describing tasks

Information Processing and Management: an International Journal
Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
Unifying textual and visual cues for content-based image retrieval on the World Wide Web

Computer Vision and Image Understanding - Special issue on content-based access for image and video libraries
A Survey on Content-Based Retrieval for Multimedia Databases

IEEE Transactions on Knowledge and Data Engineering

Virtual multimedia libraries built from the web

Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries
Researchexplorer: gaining insights through exploration in multimedia scientific data

Proceedings of the 6th ACM SIGMM international workshop on Multimedia information retrieval
Automated gathering of Web information: An in-depth examination of agents interacting with search engines

ACM Transactions on Internet Technology (TOIT)
Image retrieval: Ideas, influences, and trends of the new age

ACM Computing Surveys (CSUR)
Modal keywords, ontologies, and reasoning for video understanding

CIVR'03 Proceedings of the 2nd international conference on Image and video retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Finding multimedia objects to meet some need is considerably harder on the World Wide Web than finding text because content-based retrieval of multimedia is much harder than text retrieval and caption text is inconsistently placed.We describe a Web "crawler" and caption filter MARIE-4 we have developed that searches the Web to find text likely to be image captions and its associated image objects.Rather than examining a few features like existing systems, it uses broad set of criteria including some novel ones to yield higher recall than competing systems, which generally focus on high precision.We tested these criteria in careful experiments that extracted 8140 caption candidates for 4585 representative images, and quantified for the first time the relative value of several kinds of clues for captions.The crawler is self-improving in that it obtains from experience further statistics as positive and negative clues.We index the results found by the crawler and provide a user interface.We have done a demonstration implementation ofa Web search engine for all 667,573 publicly-accessible U.S. Navy Web images.