Surpassing State-of-the-Art Accuracy in Recommendation Models
Recommender systems are a ubiquitous part of many common and broadly used internet services. They are utilized in retail and e-commerce applications to cross-sell and up-sell products and services. Online consumer services for ridesharing, peer reviews, and banking services rely heavily on recommendation models to deliver fast and efficient customer experiences. Everyday examples of recommender systems offering users hit or miss advice on social media, news sites, etc. are abundant. This is because a company’s ability to provide richer, more meaningful recommendations requires many more attributes to be incorporated into a recommendation system beyond just a user’s browsing or purchase history. This seems simple and intuitive enough. However, real-world implementations with legacy technology components can diminish efforts to achieve state-of-the-art accuracy.
Recommendation Tasks Place Huge Demands on Both Memory and Computation
The backbone that enables recommendation models to encode such massive volumes of data is the embedding. Embedding tables are large numerical tables that contain encodings of every feature in the data – every user, product, region, etc. It’s well known that larger embedding tables lead to better model quality by making them more expressive and accurate. In order to fully capture all of the information in their data, SambaNova’s industry partners easily utilize embeddings that are hundreds of gigabytes in size—often terabytes!
These embeddings are attached to deep neural networks which perform a large number of calculations in order to generate the final recommendation result.
As a demonstration, we used the Sambanova DataScale system, which is a complete integrated software and hardware system, to train the Deep Learning Recommendation Model (DLRM) on the Criteo Terabyte Clicklogs dataset. This is the MLPerf standard benchmark for recommendation, where the performance metric is AUC on a test set.
NOTE: Despite containing ~1TB in data and ~100GB in embedding features, it’s important to note that this dataset still does not represent a real large-scale production workload. Deployed systems are at least 5x more demanding in terms of both data and embedding sizes. But rest-assured—SambaNova Systems Reconfigurable Dataflow Unit (RDU) and the SambaNova DataScale system are built to scale and are well-equipped to tackle those gigantic use cases too.
Unleashing the Power of Embeddings
It’s known that increasing embedding dimensions improves recommender model accuracy at the cost of model size. Many recent studies have been devoted to sharding the model or reducing the embedding dimensions to fit in GPU memory. SambaNova Systems researchers have pioneered superior methods for solving this problem via vertical engineering through our integrated software and hardware stack. We demonstrate this by exceeding state-of-the-art accuracy on the DLRM model by significantly increasing the embedding dimensionality. In an ablation study where everything else is held constant, we find that the model’s accuracy strictly increases with embedding dimensions when trained on a single SambaNova Systems RDU. Meanwhile, on a single GPU, model execution attempts result in catastrophic failure.
Fig 1: Effects of Embeddings dimensions on single RDU and single GPU
Exploring New Batch Sizes and Breaking the GPU Mold
Popular training techniques place a large focus on increasing mini-batch size to saturate GPU computation. For example, Nvidia’s demo implementation of DLRM uses batch sizes of 32768 and higher.
From a statistical standpoint, this isn’t always the preferred decision. As studied, decreasing the batch size can actually have strong benefits, helping a model avoid sharp minima so it can generalize more effectively. When training DLRM on the SambaNova Systems RDU, we observed noticeable improvements in validation performance when decreasing the batch size.
Fig 2: Enhanced RDU performance with batch size reduction
In reality, machine learning researchers and engineers choose these giant, suboptimal batch sizes because their current infrastructure leaves them no alternative. The GPU’s kernel-oriented execution suffers significantly when batch size decreases. On the other hand, with the SambaNova Systems RDU’s Dataflow architecture and intelligent software stack, system resources can still be fully utilized and achieve strong throughput regardless of batch size.
Fig 3: Negligible throughput degradation on RDU compared to GPU with smaller batch size
A New State of the Art
By combining our findings from above, we can use the SambaNova Systems RDU to train a new variant of DLRM that achieves a validation AUC of 0.8046 on the Criteo Terabyte dataset. In comparison, the best AUC reported by NVIDIA in their MLPerf submission is 0.8027. This unique large-embedding, small-batch model would be impossible to run on a GPU, and impractical to run on a CPU.
Fig 4: RDU exceeds MLPerf and GPU thresholds when training a new DLRM variant
In addition to having a noticeably higher peak AUC, the new and improved DLRM also converges much faster.
Powering Next Generation of Recommender Models
The SambaNova Systems robust yet performant RDU technology enables machine learning engineers to explore an entirely new world of models, unlocking results that surpass current state of the art. When applied to business-critical recommender models, this leads to significant enhancements in business outcomes and huge boosts in revenue. In Tencent’s words, “The reason we care about small amount AUC increase is that in several real-world applications we run internally, even 0.1% increase in AUC will have a 5x amplification (0.5% increase) when transferred to final CTR”.
Breakthrough Efficiency in NLP Model Deployment
Throughout their lifecycles, modern industrial NLP models follow a cadence. They start from one-time task-agnostic pre-training and then go through task-specific training on quickly changing user data. These periodically updated models are eventually deployed to serve massive online inference requests from applications.
A current active research trend is deploying state-of-the-art NLP models, like BERT, for online inference. As models grow larger each year, there is growing debate on how to deploy these models in real-time pipelines. To enable practical deployment, various techniques have been developed to distill large models down to compact variants. In applications such as digital assistants and search engines, these compact models are the key to attaining low-latency, high-accuracy models that satisfy service level requirements.
SambaNova Systems provides a solution for exploring and deploying these compact models—from a single SambaNova Systems Reconfigurable Dataflow Unit (RDU) scale to multiple SambaNova DataScale systems scale—delivering unprecedented advantages over conventional accelerators for low-latency, high-accuracy online inference.
The Proven Power of Dataflow Execution on RDU
The latency of compact models on GPU are fundamentally limited by its kernel-based execution mode. For online inference with batch size 1, the overhead of context switching and off-chip weight memory access for operation kernels can dominate latency on traditional architecture. SambaNova RDU is built on the SambaNova Systems Reconfigurable Dataflow Architecture (RDA) to remove this barrier. Specifically, on a recently proposed compact BERT model, TinyBERT, the RDU can attain 5.8X latency speedup over V100 GPU for MNLI, a popular text classification task.
Fig 1: Latency comparison for online inference
In applications such as a digital assistant or a search engine, the input data are natural language tokens with short sequence length, e.g., smartphone assistant queries such as “What is the weather in San Francisco?”. For these types of scenarios, reduced sequence length typically has a negligible impact on the accuracy attained by compact models. This is another characteristic that is deeply coupled with the latency advantage of RDUs. While a GPUs latency saturates with reduced sequence length for compact models, the RDUs latency improves with reduced sequence length.
As shown in Figure 2, the TinyBERT model can match state-of-the-art model accuracies across sequence lengths from 64 to 256 on the MNLI benchmark task that we use as a proxy. In Figure 3, we can see that the GPU demonstrates the same latency across sequence lengths. However, the speedup of RDU over GPU is boosted to 8.7X at reduced sequence length of 64.
Fig 2: RDU and GPU model accuracy for different sequence length
Fig 3: Bar chart for RDU and GPU latency for different sequence length
Amplifying Accuracy With SambaNova Systems DataScale
Our dataflow-optimized chip demonstrates unprecedented capability for low-latency online inference for compact models. Utilizing these capabilities from the dataflow chip our research labs have also shown the full SambaNova DataScale systems (8-sockets) can be used to attain bleeding-edge accuracy while performing low-latency inference on compact NLP models.
The study from the SambaNova Systems research lab shows that majority voting across multiple model instances can significantly boost the accuracy attained by the TinyBERT (Fig. 4). The SambaNova DataScale system is perfectly designed to efficiently exploit these accuracy gains. We show that we can deploy multiple TinyBERT models on to all eight sockets of the SambaNova DataScale system. As shown in Fig 5, when ensembling TinyBERT models, the classification accuracy is boosted for 0.4% at negligible cost on latency compared to a single TinyBERT model on an RDU.
Fig 4. Model accuracy with different numbers of experts for ensemble
Fig 5. Comparison of latency for single TinyBert on one RDU and 8 experts on 8-socket systems
The compact BERT model is just one important case where our SambaNova Systems DataScale provides a tailored solution for low-latency, high-accuracy online inference.
Pushing Computer Vision Boundaries Beyond 4K
In the context of machine learning image processing and analysis—resolution is everything. An image’s resolution can enable a more detailed, meaningful analysis that results in greater understanding. To this end, a high-resolution image will contain more information and detail than a low-resolution image of the very same subject.
Today, higher-resolution image processing requires significant computational capabilities. So much so, in fact, that training models to use these high-resolution images has rendered current state-of-the-art technologies unusable.
When it comes to high-resolution processing, legacy architecture gridlock is holding back research and technology advances across numerous use cases, including in areas such as autonomous driving, oil and gas exploration, medical imaging, anti-viral research, astronomy, and more.
Surpass the Limits of the GPU
SambaNova Systems has been working with industry partners to develop an optimized solution for training computer vision models with increasingly growing levels of resolution—without compromising high accuracy levels. We take a “clean sheet” complete systems approach to enable native support for high-resolution images. Co-designing across our complete stack of software and hardware provides the freedom and flexibility from legacy GPU architecture constraints and legacy spatial partitioning methods.
Adding More GPUs Isn’t the Answer
If you consider images in the context of AI/ML training data, the richer and more expansive your training information (i.e., images), the more accurate your results can be.
Using a single GPU to train high-resolution computer vision models predictably results in “Out of Memory” errors. On the other hand, clustering multiple GPUs brings all the challenges of disaggregation of the computational workflows onto each individual GPU in the cluster to aggregate GPU memory.
In this case, this is not merely clustering a few GPUs in a single system, but aggregating hundreds, if not thousands of GPU devices. In addition, conventional data parallel techniques that slice the input image into independent tiles deliver less accurate results than training on the original image.
Train Large Computer Vision Models with High-Resolution Images
Massive Data: A single SambaNova DataScale™ system—with petaflops of performance and terabytes of memory—is designed as a Dataflow architecture. This co-design of software and hardware properties is built to enable high-performance processing of a range of complex structures such as high-resolution images, pushing computer vision boundaries far beyond 4k.
SambaFlow™ Software: The SambaFlow software stack transforms deep learning operations to work seamlessly. SambaFlow native software support for tiled input images, intermediate tensors, and convolution overlap handling are all automated. The results are equivalent to the non-tiled version and require no changes to the application or programming model.
Robust Architecture: SambaNova’s Reconfigurable Dataflow Architecture is critical for efficient processing of input image tiles and is fully materialized in device memory, unlike with non-Dataflow devices, such as GPUs.
Re-Think What’s Possible
The resulting solution is unlimited by capacity and is capable of processing images of any size on a single DataScale system. End users then have the option of scaling up additional DataScale compute resources to further reduce training times while maintaining high levels of utilization and accuracy.
The high-resolution image processing breakthroughs achieved on SambaNova DataScale allow organizations to cut years of development time, significantly simplify architecture, and ease programmability. All this while yielding state-of-the-art results and capabilities.
A New State of the Art in NLP: Beyond GPUs
As Natural Language Processing (NLP) models increasingly evolve into bigger models, GPU performance and capability degrades at an exponential rate. We’ve been talking to a number of organizations in a range of industries that need higher quality language processing but are constrained by today’s solutions.
Groundbreaking Results, Validated in Our Research Labs
SambaNova has been working closely with many organizations the past few months and has established a new state of the art in NLP. This advancement in NLP deep learning is illustrated by a GPU-crushing, world record performance result achieved on SambaNova Systems’ Dataflow-optimized system. We used a new method to train multi-billion parameter models that we call ONE (Optimized Neural network Execution). This result highlights orders-of-magnitude performance and efficiency improvements, achieved by using significantly fewer, more powerful systems compared to existing solutions.
Break Free of GPU Handcuffs
SambaNova Systems’ Reconfigurable Dataflow Architecture™ (RDA) enables massive models that previously required 1,000+ GPUs to run on a single system, while utilizing the same programming model as on a single SambaNova Systems Reconfigurable Dataflow Unit™ (RDU).
SambaNova RDA is designed to efficiently execute a broad range of applications. RDA eliminates the deficiencies caused by the instruction sets that bottleneck conventional hardware today.
Run Large Model Architectures with a Single SambaNova Systems DataScale™ System
With GPU-based systems, developers have been forced to do complicated cluster programming for multiple racks of systems and to manually program data parallelism and workload orchestration.
A single SambaNova DataScale System with petaflops of performance and terabytes of memory ran the 100-billion parameter ONE model with ease and efficiency, and with plenty of usable headroom. Based on our preliminary work and the results we achieved, we believe running a trillion-parameter model is quite conceivable.
The proliferation of Transformer-based NLP models continues to stress the boundaries of GPU utility. Researchers are continuing to develop bigger models, and as a result the stress fractures on GPU-based deployments are also getting bigger. By maintaining the same simple programming model from one to many RDUs, organizations of all sizes can now run big models with ease and simplicity.
The sophistication of SambaNova Systems’ SambaFlow™ software stack paired with our Dataflow-optimized hardware eliminates overhead and maximizes performance to yield unprecedented results and new capabilities.
No Boundaries, Only New Possibilities for NLP
Three trends have emerged in NLP that are pushing infrastructure requirements far beyond the capabilities of current GPU architecture. These trends, below, highlight attributes that enhance SambaNova Systems DataScale’s ability to deliver world record throughput performance and unlock capabilities that were previously unattainable.
Kunle Olukotun, one of SambaNova Systems’ esteemed co-founders and the company’s chief technologist, describes our systems best: “SambaNova engineered a purpose-built Reconfigurable Dataflow Architecture that expands the horizons of capability for the future of machine learning. Users, developers, and applications are now liberated from the constraints of legacy architectures.”