119x Filetype PDF File size 0.37 MB Source: kkonstantinidis.github.io
8 IEEE COMMUNICATIONS LETTERS,VOL. 23, NO. 1, JANUARY 2019 Erasure Coding for Distributed Matrix Multiplication for Matrices With Bounded Entries Li Tang , Konstantinos Konstantinidis, and Aditya Ramamoorthy Abstract—Distributed matrix multiplication is widely used in of B. The master node does some basic processing on its end several scientific domains. It is well recognized that computation andsendsappropriatelycodedsubmatricestoeachworker.The times on distributed clusters are often dominated by the slowest workers multiply their stored (coded) submatrices and return workers (called stragglers). Recent work has demonstrated that the result to the master. The key result of [1] shows that the straggler mitigation can be viewed as a problem of designing T erasure codes. For matrices A and B, the technique essentially productA Bcanberecoveredaslongasanyτ = pmn+p1 maps the computation of ATB into the multiplication of smaller workers complete their computation; the value τ is called the (coded) submatrices. The stragglers are treated as erasures in recovery threshold of the computation. this process. The computation can be completed as long as Interestingly, similar ideas (relating matrix multiplication a certain number of workers (called the recovery threshold) to polynomial interpolation) were investigated in a different complete their assigned tasks. We present a novel coding strategy context by Yagle [2] in the mid 90’s. However, the motivation for this problem when the absolute values of the matrix entries for that work was fast matrix multiplication using pseudo- are sufficiently small. We demonstrate a tradeoff between the assumed absolute value bounds on the matrix entries and the numbertheoretic transforms, rather than fault tolerance. There recovery threshold. At one extreme, we are optimal with respect havebeenothercontributionsin this area [3]–[7] as well, some to the recovery threshold, and on the other extreme, we match of which predate [1]. the threshold of prior work. Experimental results on cloud-based clusters validate the benefits of our method. Index Terms—Distributed computing, erasure codes, Main Contributions stragglers. In this work, we demonstrate that as long as the entries in A andBareboundedbysufficientlysmallnumbers,therecovery I. INTRODUCTION threshold (τ) can be significantly reduced as compared to the approach of [1]. Specifically, the recovery threshold in our HEmultiplication of large-dimensional matrices is a key work can be of the form p0mn+p0 1 where p0 is a divisor Tproblemthat is at the heart of several big data computa- of p. Thus, we can achieve thresholds as low as mn (which is tions. For example, high-dimensional deep learning problems optimal), depending on our assumptions on the matrix entries. often require matrix-vector products at every iteration. In most We show that the required upper bound on the matrix of these problems the sheer size of the matrices precludes com- entries can be traded off with the corresponding threshold in putation on a single machine. Accordingly, the computation a simple manner. Finally, we present experimental results that is typically performed in a distributed fashion across several demonstratethe superiority of our method via an Amazon Web computation units (or workers). The overall job execution time Services (AWS) implementation. in these systems is typically dominated by the slowest worker; this is often referred to as the “straggler problem”. II. PROBLEMFORMULATION In recent years, techniques from coding theory have been Let A (size v × r)andB (size v × t) be two integer efficiently utilized in mitigating the effect of stragglers. 1 T As pointed out in [1] (see [1, Appendix B]), this issue can matrices. We are interested in computing C A B in a distributed fashion. Specifically, each worker node can store a be viewed as equivalent to coding for fault tolerance over a 1/mp fraction of matrix A and a 1/np fraction of matrix B. channel where the stragglers can be viewed as erasures. The job given to the worker node is to compute the product More specifically, the work of [1] considers the distributed of the submatrices assigned to it. The master node waits T computation of the product of two large matrices A and for a sufficient number of the submatrix products to be B. Matrices A and B are first partitioned into p × m and communicated to it. It then determines the final result after p×nblocks of submatrices of equal size by the master node. further processing at its end. More precisely, matrices A and Each worker is assumed to have enough memory to store the Bare first block decomposed as follows: equivalent of a single submatrix of A and a single submatrix A=[A ], 0≤i1. of two or more stragglers (average latency ≥ 15.65 seconds). Example 1: Let m = n =2, p =4and p0 =2so that Real Vandermonde matrices are well-known to have bad condition numbers. The condition number is better when we A A B B 00 01 00 01 consider complex Vandermondematrices with entries from the A A B B A= 10 11 10 11 and B = . unit circle [10]. In our method, the |X | and |Y | values can ij ij A A B B 20 21 20 21 be quite large. This introduces small errors in the decoding A A B B 30 31 30 31 ˆ T process. Let C be the decoded matrix and C A B be the ˆ We let actual product. Our error metric is e = ||CC||F (subscript F ||C||F ˜ 1 1 refers to the Frobenius norm). The results in Fig. 1, had an A(s,z)=A +A s +(A +A s )z 00 10 20 30 7 +(A +A s1)z2+(A +A s1)z3, and error e of at most 10 . We studied the effect of increasing 01 11 21 31 the average value of the entries in A and B in Table I. ˜ B(s,z)=(B +B s)z+B +B s The error is consistently low up to a bound of L = 1000, 00 10 20 30 +(B +B s)z5+(B +B s)z4. following which the calculation is useless owing to numerical 01 11 21 31 overflowissues. We point out that in our experiments the error The product of the above polynomials can be verified to con- e was identically zero if the zi’s were chosen from the unit 3 5 7 circle. However, this requires complex multiplication, which tain the useful terms with coefficients z,z ,z ,z ; the others are interference terms. For this scheme the corresponding increases the computation time. |Xij| can at most be 2L2, though the recovery threshold is 9. Applying the method of Section III-B would result in the REFERENCES |X | values being bounded by 8L4 with a threshold of 4. [1] Q. Yu, M. A. Maddah-Ali, and A. S. Avestimehr. (2018). “Straggler ij mitigation in distributed matrix multiplication: Fundamental limits and optimal coding.” [Online]. Available: https://arxiv.org/abs/1801.07487 [2] A. E. Yagle, “Fast algorithms for matrix multiplication using pseudo- V. EXPERIMENTALRESULTSANDDISCUSSION number-theoretic transforms,” IEEE Trans. Signal Process., vol. 43, no. 1, pp. 71–76, Jan. 1995. We ran our experiments on AWS EC2 r3.large instances. [3] S. Dutta, V. Cadambe, and P. Grover, “Short-dot: Computing large linear Our code is available online [8]. The input matrices transforms distributedly using coded short dot products,” in Proc. Adv. Aand B were randomly generated integer matrices of size Neural Inf. Process. Syst. (NIPS), 2016, pp. 2100–2108. [4] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, 8000 × 8000 with elements in the set {0,1,...,50}.These “Speeding up distributed machine learning using codes,” IEEE Trans. matrices were pre-generated (for the different straggler counts) Inf. Theory, vol. 64, no. 3, pp. 1514–1529, Mar. 2018. and remained the same for all experiments. The master node [5] K. Lee, C. Suh, and K. Ramchandran, “High-dimensional coded matrix multiplication,” in Proc. IEEE Int. Symp. Inf. Theory, Jun. 2017, wasresponsiblefor the 2×2 block decomposition of A and B, pp. 2418–2422. ˜ ˜ [6] A. Mallick, M. Chaudhari, and G. Joshi. (2018). “Rateless codes for computingA(s,zi) and B(s,zi) for i =1,...,10 and sending near-perfect load balancing in distributed matrix-vector multiplication.” them to the worker nodes. The evaluation points (zi’s) were [Online]. Available: https://arxiv.org/abs/1804.10331 chosen as 10 equally spaced reals within the interval [1,1]. [7] S. Wang, J. Liu, and N. Shroff, “Coded sparse matrix multi- The stragglers were simulated by having S randomly chosen plication,” in Proc. Int. Conf. Mach. Learn. (ICML), Jul. 2018, machines perform their local computation twice. pp. 5152–5160. [8] Repository of Erasure Coding for Distributed Matrix Multiplication for We compared the performance of our method (cf. Section Matrices With Bounded Entries. Accessed: 2018. [Online]. Available: III) with [1]. For fairness, we chose the same evaluation points https://bitbucket.org/kkonstantinidis/stragglermitmm/src/master in both methods. In fact, the choice of points in their code [9] Repository of Polynomial Code for Prior Implementation. Accessed: 2018. [Online]. Available: https://github.com/AvestimehrResearchGroup/ available online [9] (which we adapted for the case when Polynomial-Code p>1), provides worse results than those reported here. [10] W. Gautschi, “How (un)stable are vandermonde systems?” in Asymp- Computation latency refers to the elapsed time from the totic and Computational Analysis (Lecture Notes in Pure and Applied Mathematics), vol. 124. New York, NY, USA: Dekker, 1990, point when all workers have received their inputs until enough pp. 193–210. Authorized licensed use limited to: Iowa State University. Downloaded on June 23,2020 at 19:13:30 UTC from IEEE Xplore. Restrictions apply.
no reviews yet
Please Login to review.