Skip to the content.

Distributed ML

Contact me


来自Berkeley CS120x课程第二周Distributed ML。

Computation and Storage

Least Squares Regression: Learn mapping(W) from features to labels that minimizes residual sum of squares:

\[\min_W \Vert XW-y\Vert_2^2\]

Closed form solution, (if inverse exists):

\[W=(X^TX)^{-1}X^Ty\]

Consider number of arithmetic operations ( +, −, ×, / ), computational bottlenecks:

Consider storing values as floats (8 bytes), storage bottlenecks:

Computation: $O(nd^2 + d^3)$ operations

Storage: $O(nd + d^2)$ floats

Big n and Small d

Assume $O(d^3)$ computation and $O(d^2)$ storage feasible on single machine, storing $X$ and computing $X^TX$ are the bottlenecks.

Can distribute storage and computation!

distributed-ml1.png

trainData.map(computeOuterProduct)         .reduce(sumAndInvert)

Big n and Big d

As before, storing and computing are bottlenecks Now, storing and operating on is also a bottleneck.

Can’t easily distribute!

1st Rules of thumb

Computation and storage should be linear (in n, d)

We need methods that are linear in time and space.

One idea: Exploit sparsity

Another idea: Use different algorithms

Gradient descent is an iterative algorithm that requires O(nd) computation and O(d) local storage per iteration.

distributed-ml4.png

distributed-ml5.png

for i in range(numIters):
    alpha_i = alpha / (n * np.sqrt(i + 1))
    gradient = train.map(lambda lp: gradientSummand(w, lp))
    w -= alpha_i * gradient
return w

Gradient Descent Summary:

Communication Principles

distributed-ml6.png

Access rates fall sharply with distance:

2nd Rule of thumb

Perform parallel and in-memory computation

Persisting in memory reduces communication

Scale-up (powerful multicore machine)

Scale-out (distributed, e.g., cloud-based)

3rd Rule of thumb

Minimize Network Communication

Minimize Network Communication - Stay Local

Example: Linear regression, big n and small d

distributed-ml9.png

Example: Linear regression, big n and big d

distributed-ml10.png

Example: Hyperparameter tuning for ridge regression with small n and small d

Example: Linear regression, big n and huge d

Minimize Network Communication - Reduce Iterations

Distributed iterative algorithms must compute and communicate

Distributed Computing Properties

Idea: Design algorithms that compute more, communicate less

Extreme: Divide-and-conquer

w = train.mapPartitions(localLinearRegression).reduce(combineLocalRegressionResults)

for i in range(numIters):
    alpha_i = alpha / (n * np.sqrt(i + 1))
    gradient = train.map(lambda lp: gradientSummand(w, lp)).sum()
    w -= alpha_i * gradient

Less extreme: Mini-batch

distributed-ml11.png

We can amortize latency!


1st Rules of thumb

Computation and storage should be linear (in n, d)

2nd Rule of thumb

Perform parallel and in-memory computation

3rd Rule of thumb

Minimize Network Communication