High performance index build algorithms for intranet search engines

Authors:
Marcus Fontoura;Engene Shekita;Jason Y. Zien;Sridhar Rajagopalan;Andreas Neumann
Affiliations:
IBM Almaden Research Center, San Jose, CA;IBM Almaden Research Center, San Jose, CA;IBM Almaden Research Center, San Jose, CA;IBM Almaden Research Center, San Jose, CA;IBM Almaden Research Center, San Jose, CA
Venue:
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Year:
2004

Citing 22
Cited 11

Fast hashing of variable-length text strings

Communications of the ACM
Introduction to algorithms

Introduction to algorithms
The design and implementation of a log-structured file system

ACM Transactions on Computer Systems (TOCS)
Algorithms in C++

Algorithms in C++
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Space/time trade-offs in hash coding with allowable errors

Communications of the ACM
Building a distributed full-text index for the Web

Proceedings of the 10th international conference on World Wide Web
Rank aggregation methods for the Web

Proceedings of the 10th international conference on World Wide Web
Static index pruning for information retrieval systems

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Effective site finding using link anchor information

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Stable algorithms for link analysis

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Template detection via data mining and its applications

Proceedings of the 11th international conference on World Wide Web
Modern Information Retrieval

Modern Information Retrieval
Comparing top k lists

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Fast Incremental Indexing for Full-Text Information Retrieval

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
SemTag and seeker: bootstrapping the semantic web via automated semantic annotation

WWW '03 Proceedings of the 12th international conference on World Wide Web
Searching the workplace web

WWW '03 Proceedings of the 12th international conference on World Wide Web
Analysis of anchor text for web search

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Efficient single-pass index construction for text databases

Journal of the American Society for Information Science and Technology
Mining anchor text for query refinement

Proceedings of the 13th international conference on World Wide Web
Optimized query execution in large search engines with global page ordering

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29

Virtual cursors for XML joins

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Sampling search-engine results

WWW '05 Proceedings of the 14th international conference on World Wide Web
Static score bucketing in inverted indexes

Proceedings of the 14th ACM international conference on Information and knowledge management
Using annotations in enterprise search

Proceedings of the 15th international conference on World Wide Web
Navigating the intranet with high precision

Proceedings of the 16th international conference on World Wide Web
Just in time indexing for up to the second search

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Relaxation in text search using taxonomies

Proceedings of the VLDB Endowment
A search-based method for forecasting ad impression in contextual advertising

Proceedings of the 18th international conference on World wide web
Caching search engine results over incremental indices

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Efficiently encoding term co-occurrences in inverted indexes

Proceedings of the 20th ACM international conference on Information and knowledge management
Indexing shared content in information retrieval systems

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

There has been a substantial amount of research on high-performance algorithms for constructing an inverted text index. However, constructing the inverted index in a intranet search engine is only the final step in a more complicated index build process. Among other things, this process requires an analysis of all the data being indexed to compute measures like PageRank. The time to perform this global analysis step is significant compared to the time to construct the inverted index, yet it has not received much attention in the research literature. In this paper, we describe how the use of slightly outdated information from global analysis and a fast index construction algorithm based on radix sorting can be combined in a novel way to significantly speed up the index build process without sacrificing search quality.