Development of a large-scale web crawler and search engine infrastructure

Authors:
Susumu Akamine;Yoshikiyo Kato;Daisuke Kawahara;Keiji Shinzato;Kentaro Inui;Sadao Kurohashi;Yutaka Kidawara
Affiliations:
National Institute of Information and Communications Technology, Soraku-gun, Kyoto, Japan;National Institute of Information and Communications Technology, Soraku-gun, Kyoto, Japan;National Institute of Information and Communications Technology, Soraku-gun, Kyoto, Japan;Kyoto University, Yoshida Honmachi, Kyoto, Japan;National Institute of Information and Communications Technology, Soraku-gun, Kyoto, Japan and Nara Institute of Science and Technology, Ikoma, Nara, Japan;National Institute of Information and Communications Technology, Soraku-gun, Kyoto, Japan and Kyoto University, Yoshida Honmachi, Kyoto, Japan;National Institute of Information and Communications Technology, Soraku-gun, Kyoto, Japan
Venue:
Proceedings of the 3rd International Universal Communication Symposium
Year:
2009

Citing 7
Cited 1

The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Challenges in web search engines

ACM SIGIR Forum
How are we searching the world wide web?: a comparison of nine search engine transaction logs

Information Processing and Management: an International Journal - Special issue: Formal methods for information retrieval
IRLbot: scaling to 6 billion pages and beyond

Proceedings of the 17th international conference on World Wide Web
Recrawl scheduling based on information longevity

Proceedings of the 17th international conference on World Wide Web
Information Credibility Analysis of Web Contents

ISUC '08 Proceedings of the 2008 Second International Symposium on Universal Communication
The impact of crawl policy on web search effectiveness

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

Future Directions of Knowledge Systems Environments for Web 3.0

Proceedings of the 2011 conference on Information Modelling and Knowledge Bases XXII

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper reports the ongoing development of a large-scale Web crawler and search engine infrastructure at National Institute of Information and Communications Technology. This infrastructure has the following characteristics: (1) It collects one billion Japanese Web pages while keeping them up-to-date. (2) It selects 100 million pages from among the collected pages and converts them into a standard data format to store the results of morphological analysis, dependency parsing, and synonym augmentation. (3) The selected set of pages is searchable and accessible to the users. (4) The scalability of the system is achieved by using a large-scale cluster machine for distributed data processing.