The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Challenges in web search engines
ACM SIGIR Forum
How are we searching the world wide web?: a comparison of nine search engine transaction logs
Information Processing and Management: an International Journal - Special issue: Formal methods for information retrieval
IRLbot: scaling to 6 billion pages and beyond
Proceedings of the 17th international conference on World Wide Web
Recrawl scheduling based on information longevity
Proceedings of the 17th international conference on World Wide Web
Information Credibility Analysis of Web Contents
ISUC '08 Proceedings of the 2008 Second International Symposium on Universal Communication
The impact of crawl policy on web search effectiveness
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Future Directions of Knowledge Systems Environments for Web 3.0
Proceedings of the 2011 conference on Information Modelling and Knowledge Bases XXII
Hi-index | 0.00 |
This paper reports the ongoing development of a large-scale Web crawler and search engine infrastructure at National Institute of Information and Communications Technology. This infrastructure has the following characteristics: (1) It collects one billion Japanese Web pages while keeping them up-to-date. (2) It selects 100 million pages from among the collected pages and converts them into a standard data format to store the results of morphological analysis, dependency parsing, and synonym augmentation. (3) The selected set of pages is searchable and accessible to the users. (4) The scalability of the system is achieved by using a large-scale cluster machine for distributed data processing.