Introducing SambaNova Systems DataScale: A New Era of Computing

With each generation, we have pushed the limits of innovation and discovery to make the incredible happen with technology. Some technology is truly revolutionary and paves the way for transformations we have yet to imagine.

On a global scale, the innovations materializing today from machine learning and AI will change the way we work and live forever. And within the tech sector, the need to pioneer these innovations has forced the entire industry to re-examine how we design next-generation infrastructure for complex machine learning workloads that require more than just transactional processing and raw performance.

When SambaNova Systems first set out, we had one goal: To make AI accessible to organizations of all sizes across all industries. And we are delivering on that promise. Today, with our technology, we are on the brink of one of the biggest transformations in computing history since the advent of the Internet.

We are proud and excited to introduce the world’s next-generation computing infrastructure—SambaNova Systems DataScale™. DataScale is ushering in a new era of computing and is giving us a clearer view of what the future of computing will look like.

What makes DataScale so special?

Unlike conventional hardware architectures, which present a fixed set of instructions for developers to piece together, software-defined hardware enables developers to think from a software-first perspective. This empowerment results in orders-of-magnitude improvements in efficiency and unlocks greater compute power to meet the rigorous demands of AI application development.

The SambaNova Systems Reconfigurable Dataflow Architecture™ (RDA) is the answer to the industry’s needs for a software-first approach and is the blueprint for DataScale. RDA is a spatially reconfigurable architecture designed to efficiently execute a broad range of AI applications and models of all sizes and forms.

While other AI infrastructure companies are focusing on just one technology component—the chip—DataScale is a complete, integrated software and hardware systems platform optimized for dataflow from algorithms to silicon.

With DataScale, rather than being constrained by the limitations of traditional hardware infrastructure, developers can focus on discovering new opportunities to innovate and accomplish what they once thought impossible.

Powered by SambaNova’s Reconfigurable Dataflow Unit™ (RDU), a next-generation processor built from the ground up to offer native dataflow processing, DataScale helps to future-proof your data center.

The reconfigurable and flexible characteristics of RDUs and the high-speed fabric that connects them means maximum system throughput and performance no matter what is thrown at them. Most importantly, it means a stack that can be optimized to meet the changing AI demands of the near future.

Our incredible customers are already doing great work with DataScale. Lawrence Livermore National Laboratory, for example, is coupling DataScale into its Corona supercomputing system, which is being used for COVID-19 drug discovery. And Los Alamos National Laboratory uses DataScale in modeling extremely complex quantum chemistry.

Achieving world record-breaking performance metrics

DataScale achieves record-breaking performance metrics from system level to multi-rack scale when compared to the latest, most advanced platforms used in four key areas: Performance, accuracy, scale, and ease of use. I invite you to read the press release for details.

With our launch, we are also introducing an industry-first—a subscription-based offering called Dataflow-as-a-Service (DaaS). DaaS is available in three monthly subscription types customized for natural language processing, high-res computer vision, or recommender systems. They are accessible in both cost and configuration, and deliver on SambaNova’s promise to make AI more accessible to organizations of all sizes across all industries.

In addition, we are also granting easy, powerful cloud access to both academics and researchers. SambaNova AI Cloud Platform for universities and research laboratories gives users access to all the power of DataScale without the physical hardware. We’re accepting research proposals now.

At SambaNova, we are proud of what our customers are accomplishing with DataScale. And we look forward to what we will continue to accomplish together as organizations all over the world are empowered by a new, better way of computing.

Accelerating the Modern Machine Learning Workhorse: Recommendation Inference

Updated January 31, 2021

Inference for recommender systems is perhaps the single most widespread machine learning workload in the world. Here, we demonstrate that using the SambaNova DataScale system, we can perform recommendation inference over 20x faster than the leading GPU on an industry-standard benchmark model. Keep watching us in this space as our software is rapidly evolving to deliver continuous improvements.  We will be sure to keep you updated.

The impact of this is massive from both technology and business standpoints. According to Facebook, 79% of AI inference cycles in their production data centers are devoted to recommendation (source). These engines serve as the primary drivers for user engagement and profit across numerous other Fortune 100 companies, with 35% of Amazon purchases and 75% of watched Netflix shows coming from recommendations (source).

Record-breaking Recommendation Speed

To measure the performance of the SambaNova DataScale system, we use the recommendation model from the MLPerf benchmark, the authoritative benchmark for machine learning researchers and practitioners. Their task for measuring recommendation performance uses the DLRM model on the Terabyte Clickthrough dataset. Since Nvidia has not reported A100 numbers, we measure an Nvidia optimized version of this model (source) running on a single A100 that is deployed using a Triton Server (version 20.06) with FP16 precision. We run this at a variety of batch sizes as this simulates a realistic deployed inference scenario. For V100 numbers, we use the FP16 performance results reported from Nvidia (source).

Low batch sizes are often needed in deploy scenarios as queries are streamed in real time and latency is critical. At these low batch sizes, the benefit of the dataflow architecture is clear and the SambaNova DataScale system commands 20x faster performance than a single A100 at batch size 1.

While online inference at batch size 1 is a common use case in deployed systems, customers also often want to batch some of their data to improve the overall throughput of the system. To demonstrate the benefits of the SambaNova DatasScale system, we also show the same DLRM benchmark at a batch size of 4k. At this higher batch size, the DataScale achieves over 2x faster performance than an A100 for both throughput and latency.

The Combined Solution: Training and Inference Together
While many of these measurements are geared towards MLPerf’s inference task, the DataScale system excels at both inference and training. By retraining the same DLRM model from scratch, and exploring variations which aren’t possible at all on GPU hardware, the RDU handily exceeds State of the Art. Check out this article to find out more.

Beyond the Benchmark: Recommendation Models in Production
The MLPerf DLRM benchmark simulates a realistic recommendation task, but it cannot capture the scale of a real deployed workload. In an analysis of these recommendation systems, Facebook writes that “production-scale recommendation models have orders of magnitude more embeddings” compared to benchmarks (source). As these models grow, CPUs and GPUs start to falter. Yet the DataScale system has no problem handling these larger compute and memory requirements, and continues to be a long-term solution that’s built to scale.

Premier Research Labs Push AI to Fight Disease—and Improve Lives

U.S. Department of Energy Accelerates AI With SambaNova Systems

Scientific researchers are exploring ways to combine artificial intelligence (AI) and machine learning (ML) for running complex scientific workloads to gain better performance and efficiency. To solve this problem, the United States Department of Energy’s National Nuclear Security Administration (DOE/NNSA), Lawrence Livermore National Laboratory (LLNL), and Los Alamos National Laboratory (LANL) announced a strategic partnership. The cornerstone of this partnership agreement is multiple installations of SambaNova Systems DataScale™.

SambaNova DataScale is a complete, integrated software and hardware systems platform optimized for dataflow from algorithms to silicon. LLNL is coupling DataScale into its Corona supercomputing system. Initial focus has been on using DataScale for National Ignition Facility applications. Corona is primarily being used for COVID-19 drug discovery and LLNL plans to apply DataScale to this workload.

Improved Performance, Accuracy, and Productivity With SambaNova DataScale

SambaNova DataScale is improving overall performance, accuracy, and productivity for these demanding research institutions.

It’s no surprise, as SambaNova DataScale is designed for both efficient deep-learning inference and training calculations. It features the SambaFlow™ software stack and the world’s first Reconfigurable Dataflow Unit, the Cardinal SN10™ RDU. The system contains eight RDUs—each one capable of supporting multiple simultaneous jobs or working seamlessly together to execute large-scale models.


Image of SambaNova DataScale


Ian Karlin is the principal HPC strategist at LLNL. After bringing SambaNova DataScale on-site in September, he has already reported that early tests have shown DataScale to be 5X or better when normalized against GPUs.

Karlin says DataScale was the right choice for LLNL for several reasons; chief among them was the integrated software and hardware systems and the ability to do both training and inference on one platform.

Computer scientist and LLNL Informatics Group Leader, Brian Van Essen explains, “We selected SambaNova for this procurement because one of the key features they have is the ability to do training and inference on small batch sizes. Inference at small scales is key; training on small batches is important for retraining and fine-tuning models. That’s something we’ll be doing.” He also cites “maturity of the programming model and the team’s expertise with the software stack” as a crucial aspect of the LLNL’s two-year engagement with SambaNova.

Over at LANL, the first application targeted for acceleration with DataScale is modeling quantum chemistry with density-functional theory (DFT)-level accuracy. LANL has developed a workflow for building machine learning models of interatomic energies and forces to enable molecular dynamics (MD) simulations with high accuracy in a computationally efficient manner. These ML models are very faithful to DFT reference calculations and enable reactive chemistry from first principles in support of materials science, chemistry, molecular biology, and drug design.

As reported, these calculations currently run on GPU hardware and are showing further promise of acceleration with the SambaNova DataScale system. An ongoing collaboration between SambaNova Systems and LANL scientists suggests the possibility of up to 5X speedup compared to the existing GPU implementation.

Exploring Breakthrough Advances

LLNL researchers are using SambaNova DataScale to continue exploring the combination of high-performance computing (HPC) and AI, an innovative effort LLNL calls “cognitive simulation” (CogSim). Researchers said the two systems working in tandem will enable more streamlined computation and allow them to move applications into this new computing model.

SambaNova DataScale’s ability to run dozens of inference models at once while performing scientific calculations on the Corona system will aid in their quest to use machine learning to accelerate key applications.

According to LLNL researchers, SambaNova DataScale will be used in the small molecule drug design work being applied to COVID-19 at LLNL, as well as to cancer through the ATOM (Accelerating Therapeutics for Opportunities in Medicine) project. Recent work has produced a machine learning model to improve COVID-19 drug design that uses small batch training. This is important for this type of model that converges best at small batches. SambaNova DataScale has the capability for efficient small batch training—a key differentiating feature that sets it apart from GPUs. This work will be integrated into drug design loops that generate new potential compounds that then are evaluated for safety and efficacy using HPC simulations on the Corona system.

LLNL’s COVID-19 machine learning model is a finalist for the Gordon Bell Special Prize for High Performance Computing-Based COVID-19 Research, which will be announced on Nov. 19.

The AI research taking place at labs such as LLNL and LANL is not unique to the public sector. Using similar techniques, forward-thinking enterprises are advancing their own AI initiatives and making significant progress.

Here at SambaNova Systems, we’re excited about the collaboration with DOE/NNSA, LLNL, and LANL. “SambaNova Systems is providing the platform for innovation to enable visionaries to achieve breakthrough advancements in their domains,” says Rodrigo Liang, our co-founder, and CEO.

Our partnership with the U.S. Department of Energy is just one example of how we are enabling this.

Repeatable Machine Learning by Design With SambaNova Systems DataScale

If you run the same machine learning application twice with the same inputs, initializations, and random seeds, you often will not get the same result. While randomness or stochasticity in machine learning applications is a desired quality, this non-repeatability is not. For example, in mission-critical applications such as autonomous driving, this non-repeatable behavior can have disastrous implications for model explainability, especially if an audit is required to analyze certain important decisions post hoc. As a result, there is, unsurprisingly, a repeatability crisis occurring in many machine learning domains. At SambaNova Systems, we believe that your architecture should not be subtly adding to this problem in the pursuit of peak performance; hardware architectures should never trade-off repeatable computation for performance.

In the following sections, we define repeatability and why the problem arises. After that, we dive into single- and multi-socket cases that highlight the repeatability problem. In each case we show how the SambaNova Systems DataScale SN10-8R with the world’s first Reconfigurable DataFlow Unit (RDU) provides repeatability as an artifact of its design.

Repeatability: A repeatable machine learning program is one where the exact same behavior is observed when user-controlled variables are fixed (same random seeds, same initializations, same inputs, same machine). Think of this as test-retest reliability. This is different from stochasticity or randomness—which are important features that an RDU also enables in a repeatable manner for machine learning applications.

The Problem: Floating-point arithmetic is not associative; that is, the order in which you add floating-point numbers can change the final output.

(A + B) + C ≠ A + (B + C)

This fundamental property of floating-point arithmetic often leads to non-repeatability, especially when complex parallelization primitives or out-of-order execution are inherent in a hardware architecture. Although there are software solutions to ensure repeatability on traditional hardware architectures, users must comb through large amounts of documentation to figure out which operations might cause the problem and/or incur a slight performance degradation to ensure repeatability. Even worse, this problem is exposed in different ways depending on if one is running single or multi socket machine learning applications. In the next sections, we highlight the problem and the RDU’s solution in each case.

Dataflow is the Key to Single Socket Repeatability on an RDU

Due to complex parallelization primitives and out of order execution, high-performance kernels are often non-repeatable on GPU-based architectures. On an RDU, our computation is pure dataflow and, as such, our kernels are always repeatable. To highlight the repeatability of an RDU-based architecture, we compare the behavior of both an RDU and a GPU on a popular machine learning kernel seen in many customer applications. At the core of this kernel is the index add operator used in many popular models, and is an essential, fundamental indexing operation. In figure 1, we show the repeatability measured when running the popular tensor operations 5 times with the same input on both a GPU and RDU. To measure repeatability, we use the Frobenius-Norm (a popular tensor metric) of the input gradients as a proxy and compare this norm across multiple runs. As Fig.1 shows, on the RDU’s dataflow architecture, computational parallelism is always handled in a repeatable way. There is zero difference among the runs, and repeatable results are achieved during every iteration. This is one of many examples that highlight the dataflow architecture employed by an RDU is repeatable while traditional hardware architectures are not necessarily so.

Graph showing Repeatability in single-socket computation

Fig. 1. Repeatability in single-socket computation between a V100 GPU running with CUDA 10.1 and PyTorch 1.6.0 and an RDU with the latest release of SambaFlow.


RDU Model Parallelism for Repeatable Multiple Socket Training

While data parallelism remains the most popular form of multi-socket parallelism for machine learning training, model parallelism is emerging as a popular new form of parallelism for multi-socket training. Although there are many inherent benefits of model parallelism, one rarely discussed benefit of model parallelism is that it is naturally more repeatable than data parallelism. This is because data parallelism must synchronize and add gradients across multiple sockets, which can create non-repeatable results (true in popular frameworks such as Horovod). Like traditional architectures, RDUs seamlessly support both data and model parallelism. Unlike traditional architectures, model parallelism was always a core tenet of the RDU’s design.

In Fig. 2, we show what can happen due to this non-repeatability of data parallel training. In Fig. 2, we also plot the meaning of the top 200 words in a BERT language model after fine tuning the model using 8-socket training. Under the exact same conditions (same data, same machine, etc.), the meaning of these words is noticeably changing (or drifting) in the data parallel case, a very undesirable behavior for a production model. In contrast, RDU model parallel training retains fidelity between each word’s meaning across runs, which is what one would expect and desire. This is not only a desirable behavior for our customers, but a necessary one for those running mission-critical workloads.

Data Parallel vs. Model Parallel Repeatability

Comparison of word embedding training repeatability (semantic drift) for (a) 8-socket GPU and (b) 8-socket RDU

Fig. 2. Comparison of word embedding training repeatability (semantic drift) for (a) 8-socket GPU and (b) 8-socket RDU


The Solution

Based on our own first-hand experiences working in collaboration with partners in our research laboratory, we made repeatability a core feature when designing the SambaNova Systems Reconfigurable Dataflow Unit (RDU) architecture. Our belief is that repeatability and performant hardware is paramount for organizations focused on fast-paced innovation. Therefore, when using RDUs, users do not need to worry about repeatability issues because not only are we more performant than many traditional architectures, but repeatability is a core tenet of our design that can be achieved without compromise.

Surpassing State-of-the-Art Accuracy in Recommendation Models

Recommender systems are a ubiquitous part of many common and broadly used internet services. They are utilized in retail and e-commerce applications to cross-sell and up-sell products and services. Online consumer services for ridesharing, peer reviews, and banking services rely heavily on recommendation models to deliver fast and efficient customer experiences. Everyday examples of recommender systems offering users hit or miss advice on social media, news sites, etc. are abundant. This is because a company’s ability to provide richer, more meaningful recommendations requires many more attributes to be incorporated into a recommendation system beyond just a user’s browsing or purchase history. This seems simple and intuitive enough. However, real-world implementations with legacy technology components can diminish efforts to achieve state-of-the-art accuracy.

Recommendation Tasks Place Huge Demands on Both Memory and Computation
The backbone that enables recommendation models to encode such massive volumes of data is the embedding. Embedding tables are large numerical tables that contain encodings of every feature in the data – every user, product, region, etc. It’s well known that larger embedding tables lead to better model quality by making them more expressive and accurate. In order to fully capture all of the information in their data, SambaNova’s industry partners easily utilize embeddings that are hundreds of gigabytes in size—often terabytes!

These embeddings are attached to deep neural networks which perform a large number of calculations in order to generate the final recommendation result.

The Benchmark
As a demonstration, we used the Sambanova DataScale system, which is a complete integrated software and hardware system, to train the Deep Learning Recommendation Model (DLRM) on the Criteo Terabyte Clicklogs dataset. This is the MLPerf standard benchmark for recommendation, where the performance metric is AUC on a test set.

NOTE: Despite containing ~1TB in data and ~100GB in embedding features, it’s important to note that this dataset still does not represent a real large-scale production workload. Deployed systems are at least 5x more demanding in terms of both data and embedding sizes. But rest-assured—SambaNova Systems Reconfigurable Dataflow Unit (RDU) and the SambaNova DataScale system are built to scale and are well-equipped to tackle those gigantic use cases too.

Unleashing the Power of Embeddings
It’s known that increasing embedding dimensions improves recommender model accuracy at the cost of model size. Many recent studies have been devoted to sharding the model or reducing the embedding dimensions to fit in GPU memory. SambaNova Systems researchers have pioneered superior methods for solving this problem via vertical engineering through our integrated software and hardware stack. We demonstrate this by exceeding state-of-the-art accuracy on the DLRM model by significantly increasing the embedding dimensionality. In an ablation study where everything else is held constant, we find that the model’s accuracy strictly increases with embedding dimensions when trained on a single SambaNova Systems RDU. Meanwhile, on a single GPU, model execution attempts result in catastrophic failure.

Fig 1: Effects of Embeddings dimensions on single RDU and single GPU

Fig 1: Effects of Embeddings dimensions on single RDU and single GPU

Exploring New Batch Sizes and Breaking the GPU Mold
Popular training techniques place a large focus on increasing mini-batch size to saturate GPU computation. For example, Nvidia’s demo implementation of DLRM uses batch sizes of 32768 and higher.

From a statistical standpoint, this isn’t always the preferred decision. As studied, decreasing the batch size can actually have strong benefits, helping a model avoid sharp minima so it can generalize more effectively. When training DLRM on the SambaNova Systems RDU, we observed noticeable improvements in validation performance when decreasing the batch size.

Fig 2: Enhanced RDU performance with batch size reduction

Fig 2: Enhanced RDU performance with batch size reduction

In reality, machine learning researchers and engineers choose these giant, suboptimal batch sizes because their current infrastructure leaves them no alternative. The GPU’s kernel-oriented execution suffers significantly when batch size decreases. On the other hand, with the SambaNova Systems RDU’s Dataflow architecture and intelligent software stack, system resources can still be fully utilized and achieve strong throughput regardless of batch size.

Fig 3: Negligible throughput degradation on RDU compared to GPU with smaller batch size

Fig 3: Negligible throughput degradation on RDU compared to GPU with smaller batch size

A New State of the Art
By combining our findings from above, we can use the SambaNova Systems RDU to train a new variant of DLRM that achieves a validation AUC of 0.8046 on the Criteo Terabyte dataset. In comparison, the best AUC reported by NVIDIA in their MLPerf submission is 0.8027. This unique large-embedding, small-batch model would be impossible to run on a GPU, and impractical to run on a CPU.

Fig 4: RDU exceeds MLPerf and GPU thresholds when training a new DLRM variant

Fig 4: RDU exceeds MLPerf and GPU thresholds when training a new DLRM variant

In addition to having a noticeably higher peak AUC, the new and improved DLRM also converges much faster.

Powering Next Generation of Recommender Models
The SambaNova Systems robust yet performant RDU technology enables machine learning engineers to explore an entirely new world of models, unlocking results that surpass current state of the art. When applied to business-critical recommender models, this leads to significant enhancements in business outcomes and huge boosts in revenue. In Tencent’s words, “The reason we care about small amount AUC increase is that in several real-world applications we run internally, even 0.1% increase in AUC will have a 5x amplification (0.5% increase) when transferred to final CTR”.

Breakthrough Efficiency in NLP Model Deployment

Throughout their lifecycles, modern industrial NLP models follow a cadence. They start from one-time task-agnostic pre-training and then go through task-specific training on quickly changing user data. These periodically updated models are eventually deployed to serve massive online inference requests from applications.

A current active research trend is deploying state-of-the-art NLP models, like BERT, for online inference. As models grow larger each year, there is growing debate on how to deploy these models in real-time pipelines. To enable practical deployment, various techniques have been developed to distill large models down to compact variants. In applications such as digital assistants and search engines, these compact models are the key to attaining low-latency, high-accuracy models that satisfy service level requirements.

SambaNova Systems provides a solution for exploring and deploying these compact models—from a single SambaNova Systems Reconfigurable Dataflow Unit (RDU) scale to multiple SambaNova DataScale systems scale—delivering unprecedented advantages over conventional accelerators for low-latency, high-accuracy online inference.

The Proven Power of Dataflow Execution on RDU

The latency of compact models on GPU are fundamentally limited by its kernel-based execution mode. For online inference with batch size 1, the overhead of context switching and off-chip weight memory access for operation kernels can dominate latency on traditional architecture. SambaNova RDU is built on the SambaNova Systems Reconfigurable Dataflow Architecture (RDA) to remove this barrier. Specifically, on a recently proposed compact BERT model, TinyBERT, the RDU can attain 5.8X latency speedup over V100 GPU for MNLI, a popular text classification task.

Figure 1: Latency comparison for online inference

Fig 1: Latency comparison for online inference


In applications such as a digital assistant or a search engine, the input data are natural language tokens with short sequence length, e.g., smartphone assistant queries such as “What is the weather in San Francisco?”. For these types of scenarios, reduced sequence length typically has a negligible impact on the accuracy attained by compact models. This is another characteristic that is deeply coupled with the latency advantage of RDUs. While a GPUs latency saturates with reduced sequence length for compact models, the RDUs latency improves with reduced sequence length.

As shown in Figure 2, the TinyBERT model can match state-of-the-art model accuracies across sequence lengths from 64 to 256 on the MNLI benchmark task that we use as a proxy. In Figure 3, we can see that the GPU demonstrates the same latency across sequence lengths. However, the speedup of RDU over GPU is boosted to 8.7X at reduced sequence length of 64.

Figure 2: RDU and GPU model accuracy for different sequence length

Fig 2: RDU and GPU model accuracy for different sequence length


Figure 3: Bar chart for RDU and GPU latency for different sequence length

Fig 3: Bar chart for RDU and GPU latency for different sequence length


Amplifying Accuracy With SambaNova Systems DataScale

Our dataflow-optimized chip demonstrates unprecedented capability for low-latency online inference for compact models. Utilizing these capabilities from the dataflow chip our research labs have also shown the full SambaNova DataScale systems (8-sockets) can be used to attain bleeding-edge accuracy while performing low-latency inference on compact NLP models.

The study from the SambaNova Systems research lab shows that majority voting across multiple model instances can significantly boost the accuracy attained by the TinyBERT (Fig. 4). The SambaNova DataScale system is perfectly designed to efficiently exploit these accuracy gains.  We show that we can deploy multiple TinyBERT models on to all eight sockets of the SambaNova DataScale system. As shown in Fig 5, when ensembling TinyBERT models, the classification accuracy is boosted for 0.4% at negligible cost on latency compared to a single TinyBERT model on an RDU.

Figure 4. Model accuracy with different numbers of experts for ensemble

Fig 4. Model accuracy with different numbers of experts for ensemble


Figure 5. Comparison of latency for single TinyBert on one RDU and 8 experts on 8-socket systems

Fig 5. Comparison of latency for single TinyBert on one RDU and 8 experts on 8-socket systems

The compact BERT model is just one important case where our SambaNova Systems DataScale provides a tailored solution for low-latency, high-accuracy online inference.

Pushing Computer Vision Boundaries Beyond 4K

In the context of machine learning image processing and analysis—resolution is everything. An image’s resolution can enable a more detailed, meaningful analysis that results in greater understanding. To this end, a high-resolution image will contain more information and detail than a low-resolution image of the very same subject.

Today, higher-resolution image processing requires significant computational capabilities. So much so, in fact, that training models to use these high-resolution images has rendered current state-of-the-art technologies unusable.

When it comes to high-resolution processing, legacy architecture gridlock is holding back research and technology advances across numerous use cases, including in areas such as autonomous driving, oil and gas exploration, medical imaging, anti-viral research, astronomy, and more.

Surpass the Limits of the GPU
SambaNova Systems has been working with industry partners to develop an optimized solution for training computer vision models with increasingly growing levels of resolution—without compromising high accuracy levels. We take a “clean sheet” complete systems approach to enable native support for high-resolution images. Co-designing across our complete stack of software and hardware provides the freedom and flexibility from legacy GPU architecture constraints and legacy spatial partitioning methods.


Adding More GPUs Isn’t the Answer
If you consider images in the context of AI/ML training data, the richer and more expansive your training information (i.e., images), the more accurate your results can be.

Using a single GPU to train high-resolution computer vision models predictably results in “Out of Memory” errors. On the other hand, clustering multiple GPUs brings all the challenges of disaggregation of the computational workflows onto each individual GPU in the cluster to aggregate GPU memory.

In this case, this is not merely clustering a few GPUs in a single system, but aggregating hundreds, if not thousands of GPU devices. In addition, conventional data parallel techniques that slice the input image into independent tiles deliver less accurate results than training on the original image.

Train Large Computer Vision Models with High-Resolution Images


Image showing Massive Data, SambaFlow Software and Robust Architecture

Massive Data: A single SambaNova DataScale™ system—with petaflops of performance and terabytes of memory—is designed as a Dataflow architecture. This co-design of software and hardware properties is built to enable high-performance processing of a range of complex structures such as high-resolution images, pushing computer vision boundaries far beyond 4k.

SambaFlow™ Software: The SambaFlow software stack transforms deep learning operations to work seamlessly. SambaFlow native software support for tiled input images, intermediate tensors, and convolution overlap handling are all automated. The results are equivalent to the non-tiled version and require no changes to the application or programming model.

Robust Architecture: SambaNova’s Reconfigurable Dataflow Architecture is critical for efficient processing of input image tiles and is fully materialized in device memory, unlike with non-Dataflow devices, such as GPUs.

Re-Think What’s Possible
The resulting solution is unlimited by capacity and is capable of processing images of any size on a single DataScale system. End users then have the option of scaling up additional DataScale compute resources to further reduce training times while maintaining high levels of utilization and accuracy.

The high-resolution image processing breakthroughs achieved on SambaNova DataScale allow organizations to cut years of development time, significantly simplify architecture, and ease programmability. All this while yielding state-of-the-art results and capabilities.

A New State of the Art in NLP: Beyond GPUs

As Natural Language Processing (NLP) models increasingly evolve into bigger models, GPU performance and capability degrades at an exponential rate. We’ve been talking to a number of organizations in a range of industries that need higher quality language processing but are constrained by today’s solutions.


Groundbreaking Results, Validated in Our Research Labs

SambaNova has been working closely with many organizations the past few months and has established a new state of the art in NLP. This advancement in NLP deep learning is illustrated by a GPU-crushing, world record performance result achieved on SambaNova Systems’ Dataflow-optimized system. We used a new method to train multi-billion parameter models that we call ONE (Optimized Neural network Execution). This result highlights orders-of-magnitude performance and efficiency improvements, achieved by using significantly fewer, more powerful systems compared to existing solutions.

Break Free of GPU Handcuffs
SambaNova Systems’ Reconfigurable Dataflow Architecture™ (RDA) enables massive models that previously required 1,000+ GPUs to run on a single system, while utilizing the same programming model as on a single SambaNova Systems Reconfigurable Dataflow Unit™ (RDU).

SambaNova RDA is designed to efficiently execute a broad range of applications. RDA eliminates the deficiencies caused by the instruction sets that bottleneck conventional hardware today.



Run Large Model Architectures with a Single SambaNova Systems DataScale™ System
With GPU-based systems, developers have been forced to do complicated cluster programming for multiple racks of systems and to manually program data parallelism and workload orchestration.

A single SambaNova DataScale System with petaflops of performance and terabytes of memory ran the 100-billion parameter ONE model with ease and efficiency, and with plenty of usable headroom. Based on our preliminary work and the results we achieved, we believe running a trillion-parameter model is quite conceivable.

The proliferation of Transformer-based NLP models continues to stress the boundaries of GPU utility. Researchers are continuing to develop bigger models, and as a result the stress fractures on GPU-based deployments are also getting bigger. By maintaining the same simple programming model from one to many RDUs, organizations of all sizes can now run big models with ease and simplicity.

The sophistication of SambaNova Systems’ SambaFlow™ software stack paired with our Dataflow-optimized hardware eliminates overhead and maximizes performance to yield unprecedented results and new capabilities.


No Boundaries, Only New Possibilities for NLP
Three trends have emerged in NLP that are pushing infrastructure requirements far beyond the capabilities of current GPU architecture. These trends, below, highlight attributes that enhance SambaNova Systems DataScale’s ability to deliver world record throughput performance and unlock capabilities that were previously unattainable.

Kunle Olukotun, one of SambaNova Systems’ esteemed co-founders and the company’s chief technologist, describes our systems best: “SambaNova engineered a purpose-built Reconfigurable Dataflow Architecture that expands the horizons of capability for the future of machine learning. Users, developers, and applications are now liberated from the constraints of legacy architectures.”

Learn More

Stay on Top of AI

Sign up for AI trends,
information and company news.


    for signing up.
    We will keep you posted.