Clustering and load balancing optimization for redundant content removal

  • Authors:
  • Shanzhong Zhu;Alexandra Potapova;Maha Alabduljalil;Xin Liu;Tao Yang

  • Affiliations:
  • Ask.com, Oakland, CA, USA;University of California at Santa Barbara, Santa Barbara, CA, USA;University of California at Santa Barbara, Santa Barbara, CA, USA;Amazon, Seattle, CA, USA;University of California at Santa Barbara, Santa Barbara, CA, USA

  • Venue:
  • Proceedings of the 21st international conference companion on World Wide Web
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Removing redundant content is an important data processing operation in search engines and other web applications. An offline approach can be important for reducing the engine's cost, but it is challenging to scale such an approach for a large data set which is updated continuously. This paper discusses our experience in developing a scalable approach with parallel clustering that detects and removes near duplicates incrementally when processing billions of web pages. It presents a multidimensional mapping to balance the load among multiple machines. It further describes several approximation techniques to efficiently manage distributed duplicate groups with transitive relationship. The experimental results evaluate the efficiency and accuracy of the incremental clustering, assess the effectiveness of the multidimensional mapping, and demonstrate the impact on online cost reduction and search quality.