Streaming Euclidean k-median and k-means with o(log n) Space


Samson Zhou


2023-12-13 16:00:00 ~ 2023-12-13 17:30:00






We consider the classic Euclidean k-median and k-means objective on insertion-only streams, where the goal is to maintain a (1+epsilon)-approximation to the k-median or k-means, while using as little memory as possible. Over the last 20 years, clustering in data streams has received a tremendous amount of attention and has been the test-bed for a large variety of new techniques, including coresets, the merge-and-reduce framework, bicriteria approximation, sensitivity sampling, and so on. Despite this intense effort to obtain smaller sketches for these problems, all known techniques require storing at least Omega(log(n Delta)) words of memory, where n is the size of the input and Delta is the aspect ratio. A natural question is if one can beat this logarithmic dependence on n and Delta. In this paper, we break this barrier by giving the first algorithm that achieves a (1+epsilon)-approximation to the more general (k,z)-clustering problem, using only ~O(dk/varepsilon^2)*(2^{z log z})*min(1/epsilon^z, k)*poly(log log(n Delta)) words of memory.
Joint work with Vincent Cohen-Addad and David P. Woodruff.


Samson Zhou is an assistant professor at Texas A&M University. His current research interests are on the theoretical foundations of data science, including sublinear algorithms, machine learning, security/privacy, and numerical linear algebra. Prior to his current position, he held postdoctoral researcher positions at UC Berkeley and Carnegie Mellon University, among others. He received a PhD in computer science from Purdue University and bachelor degrees in mathematics and computer science from MIT.

© John Hopcroft Center for Computer Science, Shanghai Jiao Tong University

邮箱:jhc@sjtu.edu.cn 电话:021-54740299