The average-case complexity of counting distinct elements
Proceedings of the 12th International Conference on Database Theory
Functional Monitoring without Monotonicity
ICALP '09 Proceedings of the 36th International Colloquium on Automata, Languages and Programming: Part I
On the exact space complexity of sketching and streaming small norms
SODA '10 Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms
The Computational Hardness of Estimating Edit Distance
SIAM Journal on Computing
An optimal lower bound on the communication complexity of gap-hamming-distance
Proceedings of the forty-third annual ACM symposium on Theory of computing
Fast moment estimation in data streams in optimal space
Proceedings of the forty-third annual ACM symposium on Theory of computing
Randomized algorithms for tracking distributed count, frequencies, and ranks
PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
Hi-index | 0.00 |
This thesis studies distance approximation in two closely related models - the streaming model and the two-party communication model. In the streaming model, a massive data stream is presented in an arbitrary order to a randomized algorithm that tries to approximate certain statistics of the data with only a few (usually one) passes over the data. For instance, the data may be a flow of packets on the internet or a set of records in a huge database. The size of the data necessitates the use of extremely efficient randomized approximation algorithms. Problems of interest include approximating the number of distinct elements, approximating the surprise index of a stream, or more generally, approximating the norm of a dynamically-changing vector in which coordinates are updated multiple times in an arbitrary order.In the two-party communication model, there are two parties who wish to efficiently compute a relation of their inputs. We consider the problem of approximating Lp distances for any p ≥ 0. It turns out that lower bounds on the communication complexity of these relations yield lower bounds on the memory required of streaming algorithms for the problems listed above. Moreover, upper bounds in the streaming model translate to constant-round protocols in the communication model with communication proportional to the memory required of the streaming algorithm. The communication model also has its own applications, such its secure datamining, where in addition to low communication, the goal is not to allow either party to learn more about the other's input other than what follows from the output and his/her private input. We develop new algorithms and lower bounds that resolve key open questions in both of these models. The highlights of the results are as follows. (1) We give an Ω(1/ε2) lower bound for approximating the number of distinct elements of a data stream in one pass to within a (1 ± ε) factor with constant probability, as well us the p-th frequency moment Fp for any p ≥ 0. This is tight up to very small factors, and greatly improves upon the earlier Ω(1/ε) lower bound for these problems. It also gives the same quadratic improvement for the communication complexity of 1-round protocols for approximating the Lp distance for any p ≥ 0. (2) We give a 1-pass Õ(m1-2/ p)-space streaming algorithm for (1 ± ε)-approximating the Lp norm of an m-dimensional vector presented as a data stream for any p ≥ 2. This algorithm improves the previous Õ(m 1-1/(p-1)) bound, and is optimal up to polylogarithmic factors. As a special ease our algorithm can be used to approximate the frequency moments Fp of a data stream with the same optimal amount of space. This resolves the main open question of the 1996 paper by Alon, Matias, and Szegedy. (3) In the two-party communication model, we give a protocol for privately approximating the Euclidean distance (L2) between two m-dimensional vectors, held by different parties, with only polylog m communication and O(1) rounds. This tremendously improves upon the earlier protocol of Feigenbaum, Ishai, Malkin, Nissim, Strauss, and Wright, which achieved O( m ) communication for privately approximating the Hamming distance only. This thesis also contains several previously unpublished results concerning the first item above, including new lower bounds for the communication complexity of approximating the Lp distances when the vectors are uniformly distributed and the protocol is only correct for most inputs, as well as tight lower bounds for the multiround complexity for a restricted class of protocols that we call linear. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)