Repeatable Machine Learning by Design With SambaNova Systems DataScale
If you run the same machine learning application twice with the same inputs, initializations, and random seeds, you often will not get the same result. While randomness or stochasticity in machine learning applications is a desired quality, this non-repeatability is not. For example, in mission-critical applications such as autonomous driving, this non-repeatable behavior can have disastrous implications for model explainability, especially if an audit is required to analyze certain important decisions post hoc. As a result, there is, unsurprisingly, a repeatability crisis occurring in many machine learning domains. At SambaNova Systems, we believe that your architecture should not be subtly adding to this problem in the pursuit of peak performance; hardware architectures should never trade-off repeatable computation for performance.
In the following sections, we define repeatability and why the problem arises. After that, we dive into single- and multi-socket cases that highlight the repeatability problem. In each case we show how the SambaNova Systems DataScale SN10-8R with the world’s first Reconfigurable DataFlow Unit (RDU) provides repeatability as an artifact of its design.
Repeatability: A repeatable machine learning program is one where the exact same behavior is observed when user-controlled variables are fixed (same random seeds, same initializations, same inputs, same machine). Think of this as test-retest reliability. This is different from stochasticity or randomness—which are important features that an RDU also enables in a repeatable manner for machine learning applications.
The Problem: Floating-point arithmetic is not associative; that is, the order in which you add floating-point numbers can change the final output.
(A + B) + C ≠ A + (B + C)
This fundamental property of floating-point arithmetic often leads to non-repeatability, especially when complex parallelization primitives or out-of-order execution are inherent in a hardware architecture. Although there are software solutions to ensure repeatability on traditional hardware architectures, users must comb through large amounts of documentation to figure out which operations might cause the problem and/or incur a slight performance degradation to ensure repeatability. Even worse, this problem is exposed in different ways depending on if one is running single or multi socket machine learning applications. In the next sections, we highlight the problem and the RDU’s solution in each case.
Dataflow is the Key to Single Socket Repeatability on an RDU
Due to complex parallelization primitives and out of order execution, high-performance kernels are often non-repeatable on GPU-based architectures. On an RDU, our computation is pure dataflow and, as such, our kernels are always repeatable. To highlight the repeatability of an RDU-based architecture, we compare the behavior of both an RDU and a GPU on a popular machine learning kernel seen in many customer applications. At the core of this kernel is the index add operator used in many popular models, and is an essential, fundamental indexing operation. In figure 1, we show the repeatability measured when running the popular tensor operations 5 times with the same input on both a GPU and RDU. To measure repeatability, we use the Frobenius-Norm (a popular tensor metric) of the input gradients as a proxy and compare this norm across multiple runs. As Fig.1 shows, on the RDU’s dataflow architecture, computational parallelism is always handled in a repeatable way. There is zero difference among the runs, and repeatable results are achieved during every iteration. This is one of many examples that highlight the dataflow architecture employed by an RDU is repeatable while traditional hardware architectures are not necessarily so.
Fig. 1. Repeatability in single-socket computation between a V100 GPU running with CUDA 10.1 and PyTorch 1.6.0 and an RDU with the latest release of SambaFlow.
RDU Model Parallelism for Repeatable Multiple Socket Training
While data parallelism remains the most popular form of multi-socket parallelism for machine learning training, model parallelism is emerging as a popular new form of parallelism for multi-socket training. Although there are many inherent benefits of model parallelism, one rarely discussed benefit of model parallelism is that it is naturally more repeatable than data parallelism. This is because data parallelism must synchronize and add gradients across multiple sockets, which can create non-repeatable results (true in popular frameworks such as Horovod). Like traditional architectures, RDUs seamlessly support both data and model parallelism. Unlike traditional architectures, model parallelism was always a core tenet of the RDU’s design.
In Fig. 2, we show what can happen due to this non-repeatability of data parallel training. In Fig. 2, we also plot the meaning of the top 200 words in a BERT language model after fine tuning the model using 8-socket training. Under the exact same conditions (same data, same machine, etc.), the meaning of these words is noticeably changing (or drifting) in the data parallel case, a very undesirable behavior for a production model. In contrast, RDU model parallel training retains fidelity between each word’s meaning across runs, which is what one would expect and desire. This is not only a desirable behavior for our customers, but a necessary one for those running mission-critical workloads.
Data Parallel vs. Model Parallel Repeatability
Fig. 2. Comparison of word embedding training repeatability (semantic drift) for (a) 8-socket GPU and (b) 8-socket RDU
Based on our own first-hand experiences working in collaboration with partners in our research laboratory, we made repeatability a core feature when designing the SambaNova Systems Reconfigurable Dataflow Unit (RDU) architecture. Our belief is that repeatability and performant hardware is paramount for organizations focused on fast-paced innovation. Therefore, when using RDUs, users do not need to worry about repeatability issues because not only are we more performant than many traditional architectures, but repeatability is a core tenet of our design that can be achieved without compromise.