Distributed ML
Contact me
- Blog -> https://cugtyt.github.io/blog/index
- Email -> cugtyt@qq.com
- GitHub -> Cugtyt@GitHub
来自Berkeley CS120x课程第二周Distributed ML。
Computation and Storage
Least Squares Regression: Learn mapping(W) from features to labels that minimizes residual sum of squares:
\[\min_W \Vert XW-y\Vert_2^2\]Closed form solution, (if inverse exists):
\[W=(X^TX)^{-1}X^Ty\]Consider number of arithmetic operations ( +, −, ×, / ), computational bottlenecks:
- Matrix multiply of $X^TX$: $O(nd^2)$ operations
- Matrix inverse: $O(d^3)$ operations
Consider storing values as floats (8 bytes), storage bottlenecks:
- $X^TX$ and its inverse: $O(d^2)$ floats
- $X$: $O(nd)$ floats
Computation: $O(nd^2 + d^3)$ operations
Storage: $O(nd + d^2)$ floats
Big n and Small d
Assume $O(d^3)$ computation and $O(d^2)$ storage feasible on single machine, storing $X$ and computing $X^TX$ are the bottlenecks.
Can distribute storage and computation!
- Store data points (rows of $X$) across machines
- Compute $X^TX$ as a sum of outer products X
trainData.map(computeOuterProduct) .reduce(sumAndInvert)
Big n and Big d
As before, storing and computing are bottlenecks Now, storing and operating on is also a bottleneck.
Can’t easily distribute!
1st Rules of thumb
Computation and storage should be linear (in n, d)
We need methods that are linear in time and space.
One idea: Exploit sparsity
-
Explicit sparsity can provide orders of magnitude storage and computational gains
Sparse data is prevalent:
- Text processing: bag-of-words, n-grams
- Collaborative filtering: ratings matrix
- Graphs: adjacency matrix
- Categorical features: one-hot-encoding
- Genomics: SNPs, variant calling
-
Latent sparsity assumption can be used to reduce dimension, e.g., PCA, low-rank approximation (unsupervised learning).
Another idea: Use different algorithms
Gradient descent is an iterative algorithm that requires O(nd) computation and O(d) local storage per iteration.
for i in range(numIters):
alpha_i = alpha / (n * np.sqrt(i + 1))
gradient = train.map(lambda lp: gradientSummand(w, lp))
w -= alpha_i * gradient
return w
Gradient Descent Summary:
- Pros:
- Easily parallelized
- Cheap at each iteration
- Stochastic variants can make things even cheaper
- Cons:
- Slow convergence (especially compared with closed-form)
- Requires communication across nodes!
Communication Principles
Access rates fall sharply with distance:
- Parallelism makes computation fast
- Network makes communication slow
2nd Rule of thumb
Perform parallel and in-memory computation
Persisting in memory reduces communication
- Especially for iterative computation (gradient descent)
Scale-up (powerful multicore machine)
- No network communication
-
Expensive hardware, eventually hit a wall
Scale-out (distributed, e.g., cloud-based)
- Need to deal with network communication
-
Commodity hardware, scales to massive problems
3rd Rule of thumb
Minimize Network Communication
Minimize Network Communication - Stay Local
Example: Linear regression, big n and small d
- Solve via closed form (not iterative!)
- Communicate O(d2) intermediate data
Example: Linear regression, big n and big d
- Gradient descent, communicate
- O(d) communication OK for fairly large d
- Compute locally on data (Data Parallel)
Example: Hyperparameter tuning for ridge regression with small n and small d
- Data is small, so can communicate it
- ‘Model’ is collection of regression models corresponding to different hyperparameters
- Train each model locally (Model Parallel)
Example: Linear regression, big n and huge d
- Gradient descent
- O(d) communication slow with hundreds of millions parameters
- Distribute data and model (Data and Model Parallel)
- Often rely on sparsity to reduce communication
Minimize Network Communication - Reduce Iterations
Distributed iterative algorithms must compute and communicate
- In Bulk Synchronous Parallel (BSP) systems, e.g., Apache Spark, we strictly alternate between the two
Distributed Computing Properties
- Parallelism makes computation fast
- Network makes communication slow
Idea: Design algorithms that compute more, communicate less
- Do more computation at each iteration
- Reduce total number of iterations
Extreme: Divide-and-conquer
- Fully process each partition locally, communicate final result
- Single iteration; minimal communication
- Approximate results
w = train.mapPartitions(localLinearRegression).reduce(combineLocalRegressionResults)
for i in range(numIters):
alpha_i = alpha / (n * np.sqrt(i + 1))
gradient = train.map(lambda lp: gradientSummand(w, lp)).sum()
w -= alpha_i * gradient
Less extreme: Mini-batch
- Do more work locally than gradient descent before communicating
- Exact solution, but diminishing returns with larger batch sizes
We can amortize latency!
- Send larger messages
- Batch their communication
- E.g., Train multiple models together
1st Rules of thumb
Computation and storage should be linear (in n, d)
2nd Rule of thumb
Perform parallel and in-memory computation
3rd Rule of thumb
Minimize Network Communication