Scans as Primitive Parallel Operations
IEEE Transactions on Computers
Journal of the ACM (JACM)
Finding replicated Web collections
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Proceedings of the 11th international conference on World Wide Web
A TV News Retrieval System with Interactive Query Function
COOPIS '97 Proceedings of the Second IFCIS International Conference on Cooperative Information Systems
WIRE - A WWW-Based Information Retrieval and Extraction System
DEXA '98 Proceedings of the 9th International Workshop on Database and Expert Systems Applications
Design and Implementation of a High-Performance Distributed Web Crawler
ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Automatic Discovery of Semantic Structures in HTML Documents
ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
A Probabilistic Model for Intelligent Web Crawlers
COMPSAC '03 Proceedings of the 27th Annual International Conference on Computer Software and Applications
LA-WEB '03 Proceedings of the First Conference on Latin American Web Congress
Effective page refresh policies for Web crawlers
ACM Transactions on Database Systems (TODS)
Crawling for Domain-Speci.c Hidden Web Resources
WISE '03 Proceedings of the Fourth International Conference on Web Information Systems Engineering
A Comparative Study of Online News Retrieval and Presentation Strategies
ISMSE '04 Proceedings of the IEEE Sixth International Symposium on Multimedia Software Engineering
UbiCrawler: a scalable fully distributed web crawler
Software—Practice & Experience
Categorizing and Extracting Information from Multilingual HTML Documents
IDEAS '05 Proceedings of the 9th International Database Engineering & Application Symposium
Designing efficient sampling techniques to detect webpage updates
Proceedings of the 16th international conference on World Wide Web
MapReduce: simplified data processing on large clusters
Communications of the ACM - 50th anniversary issue: 1958 - 2008
ICCSA'05 Proceedings of the 2005 international conference on Computational Science and Its Applications - Volume Part II
Design and implement a web news retrieval system
KES'05 Proceedings of the 9th international conference on Knowledge-Based Intelligent Information and Engineering Systems - Volume Part III
Hi-index | 0.00 |
Currently available web news retrieval systems face a number of problems in that web-based news retrieval requires the ability to quickly and accurately process and update a very large amount of data which are constantly being updated. In this paper, we present the development of an intelligent distributed web news retrieval system the goal of which is to accurately retrieve and organize the web news information. It includes: a novel optimized crawler algorithm whose fetching-speed is several times faster than that of the traditional crawler; a keen tag based extraction algorithm which can extract the data rich content with minimal manual effort and which also allows data to be classified as important or not important so that the crawler can revisit and update important data; a modified MapReduce improved by estimating the execution time of each subtask, which is proven to be able to reduce the number of the unusual tasks and shorten the whole job execution time.