Dimension Independent Similarity Computation using MapReduce (DIMSUM)

Reza Zadeh - Stanford University

Dec. 10, 2014, 10:30 a.m. - Dec. 10, 2014, noon

MC 103


We compute the singular values of an m x n sparse matrix A in a distributed setting, without communication dependence on m, which is useful for very large m. In particular, we give a simple nonadaptive sampling scheme where the singular values of A are estimated within relative error with constant probability. Our proven bounds focus on the Spark framework, which has become the de facto tool for handling such large matrices that cannot be stored or even streamed through a single machine. An implementation is available in Spark and Scalding.

Reza Zadeh is a consulting professor at Stanford University, and Technical Advisor to Databricks, conducting research and teaching
courses targeting doctorate students. He focuses on Discrete Applied Mathematics, Machine Learning Theory and Applications, and Large-Scale Distributed Computing. He got his Ph.D. degree in Computational Mathematics from Stanford University, and Masters degree in Computer Science from Carnegie Mellon University. He is a recipient of the Gene Golub Outstanding Thesis Award from Stanford University.