Development of an intelligent distributed news retrieval system

Authors:
James N. K. Liu;K. C. Choi;J. Y. Chai
Affiliations:
Department of Computing, The Polytechnic University of Hong Kong, Hung Hom, Kowloon, Hong Kong, China;Department of Computing, The Polytechnic University of Hong Kong, Hung Hom, Kowloon, Hong Kong, China;Department of Computing, The Polytechnic University of Hong Kong, Hung Hom, Kowloon, Hong Kong, China
Venue:
International Journal of Knowledge-based and Intelligent Engineering Systems
Year:
2012

Citing 19
Cited 0

Scans as Primitive Parallel Operations

IEEE Transactions on Computers
Parallel Prefix Computation

Journal of the ACM (JACM)
Finding replicated Web collections

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Parallel crawlers

Proceedings of the 11th international conference on World Wide Web
A TV News Retrieval System with Interactive Query Function

COOPIS '97 Proceedings of the Second IFCIS International Conference on Cooperative Information Systems
WIRE - A WWW-Based Information Retrieval and Extraction System

DEXA '98 Proceedings of the 9th International Workshop on Database and Expert Systems Applications
Design and Implementation of a High-Performance Distributed Web Crawler

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Automatic Discovery of Semantic Structures in HTML Documents

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
A Probabilistic Model for Intelligent Web Crawlers

COMPSAC '03 Proceedings of the 27th Annual International Conference on Computer Software and Applications
Cooperative Crawling

LA-WEB '03 Proceedings of the First Conference on Latin American Web Congress
Effective page refresh policies for Web crawlers

ACM Transactions on Database Systems (TODS)
Crawling for Domain-Speci.c Hidden Web Resources

WISE '03 Proceedings of the Fourth International Conference on Web Information Systems Engineering
A Comparative Study of Online News Retrieval and Presentation Strategies

ISMSE '04 Proceedings of the IEEE Sixth International Symposium on Multimedia Software Engineering
UbiCrawler: a scalable fully distributed web crawler

Software—Practice & Experience
Categorizing and Extracting Information from Multilingual HTML Documents

IDEAS '05 Proceedings of the 9th International Database Engineering & Application Symposium
Designing efficient sampling techniques to detect webpage updates

Proceedings of the 16th international conference on World Wide Web
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
On URL normalization

ICCSA'05 Proceedings of the 2005 international conference on Computational Science and Its Applications - Volume Part II
Design and implement a web news retrieval system

KES'05 Proceedings of the 9th international conference on Knowledge-Based Intelligent Information and Engineering Systems - Volume Part III

Quantified Score

Hi-index	0.00

Visualization

Abstract

Currently available web news retrieval systems face a number of problems in that web-based news retrieval requires the ability to quickly and accurately process and update a very large amount of data which are constantly being updated. In this paper, we present the development of an intelligent distributed web news retrieval system the goal of which is to accurately retrieve and organize the web news information. It includes: a novel optimized crawler algorithm whose fetching-speed is several times faster than that of the traditional crawler; a keen tag based extraction algorithm which can extract the data rich content with minimal manual effort and which also allows data to be classified as important or not important so that the crawler can revisit and update important data; a modified MapReduce improved by estimating the execution time of each subtask, which is proven to be able to reduce the number of the unusual tasks and shorten the whole job execution time.