(Compute) Power to the People: Democratizing AI: A Conversation with AI Visionaries from SambaNova Systems

 

With the rapid growth in artificial intelligence (AI) and machine learning (ML) applications, the demand for compute power is expanding at an exponential pace. Even graphics processing units (GPUs), widely adopted as AI/ML compute workhorses, can struggle to keep up with increasing processing demands.

Fortunately, ground-breaking solutions are just around the corner. At Samsung, we’re developing the next generation of DRAM memory, DDR5, as a key technology for the next generation of data processing.

SambaNova Systems, a leader in the AI/ML space, provides vitally needed AI/ML-centric computing solutions for companies and organizations of all sizes. Their platform offerings (DataScale and Dataflow-as-a-Service) expects to leverage the power of Samsung’s DDR5 to provide the performance required for AI/ML applications.

I recently had the unique opportunity to speak with four key players at SambaNova Systems. SambaNova Co-founder and CEO Rodrigo Liang, Co-founder and Chief Technologist Kunle Olukotun, VP of Product Management Marshall Choy, and Chief Architect Sumti Jairath are true visionaries in the field.

I asked them how SambaNova’s innovative systems will leverage the power of DDR5 to allow AI and ML to reach their full potential.

 

Sarah Peach (SP):
What’s going on in the world of big data computing today that makes SambaNova’s products necessary?

Rodrigo Liang (RL):
Turning data into something truly valuable—that’s our generation’s official Gold Rush. But in order to truly realize the value of their data, companies and organizations need a computing system that is easy to use, and that allows them to efficiently access the data being collected. That’s where SambaNova comes in; we help to connect the demand for data to the supply of data.

Kunle Olukotun (KO):
We have developed next-generation computation based on dataflow architecture. Dataflow computing is well-suited to tackle problems in the AI/ML space, so we set out to create the world’s first scalable dataflow system.

RL:
We saw a need for a computing platform for AI that’s easy to use and provides capabilities not covered by existing solutions. From natural language processing (NLP) to high resolution video, there is a growing need for a high-capacity AI/ML platform. We figured out how to do that type of high-capacity compute more efficiently.

SP:
Many AI/ML platforms rely on GPUs for efficient compute. What’s different about SambaNova’s approach?

Sumti Jairath (SJ):
GPUs are essentially hardware built for video games, and AI/ML clearly needs more powerful platforms. Our higher capacity platform starts to unlock the possibility of higher resolution images. Particularly for areas like medical imaging: we can push far beyond 4k in MRIs, for instance. Our platforms also enable state-of-the-art NLP models and recommendation engines.

RL:
In the past, in order to maximize the performance of GPUs, manufacturers would use HBM [High Bandwidth Memory] rather than DDR. While HBM provided greater bandwidth than DDR, the tradeoff was that it had a lower memory capacity. But our solutions combine high-capacity DDR with high bandwidth.

SP:
Samsung’s memory and storage technologies are helping companies like SambaNova make AI/ML a reality today. With DDR5 DRAM becoming available, how do you see this technology improving SambaNova’s offerings to its current and potential customer base?

RL:
Demand for compute power is outpacing supply. At SambaNova, we’re always looking for new technology solutions to try and bridge this gap between demand and supply. DDR5 is one such solution. DDR5 allows higher accuracy when running machine learning models while minimizing the issue of labeling image data for ML and retaining the information in high resolution images. And with NLP, the greater memory capacity of DDR5 means more parameters and vastly improved performance.

SJ:
DDR5 influences two key factors: dataflow architecture and memory capacity. Higher resolution models require much more memory capacity and DDR5 provides this.

 

SP:
What benefits do your high-capacity AI platforms deliver to end users, and to society?

RL:
For the everyday mobile or home user, more memory capacity from DDR means better recommendation engines on apps and websites. More memory gives developers the ability to do larger embeddings. In retail, for instance, this will allow ecommerce sites to more accurately recommend related items, leading to better consumer experiences, while increasing revenue for retailers.

Marshall Choy (MC):
The power that AI and ML brings to medical research is accelerating drug research and drug trials processes, which could lead to the discovery of new drugs and antiviral compounds. AI is already helping to accelerate Covid-19 research by speeding up data analysis and modeling. The Lawrence Livermore Cancer Moonshot project is another example of AI being leveraged in medical research.

SJ:
Higher resolution image analysis will lead to improved cancer detection. Reading scan imagery, for example, might require high resolution over a wide area. By enabling this type of higher resolution, SambaNova’s efforts could lead to earlier detection and better patient outcomes.

SP:
Do SambaNova’s two product offerings make AI easier to use?

RL:
Yes, by increasing efficiency and ease-of-use for this kind of platform, and increasing access to the technology to smaller firms, in addition to the big players.

Our first product, DataScale, is an integrated software and hardware systems platform that allows people to run cutting-edge AI applications at scale. DataScale’s software-defined-hardware approach delivers a high degree of efficiency across applications, from training and inference, to data analytics and high-performance computing.

Our second offering is how we’ve democratized management of AI/ML modeling. Dataflow-as-a-Service is a subscription service that we developed based on customer requests. AI and ML are challenging, and most companies don’t have the expertise or manpower to manage the AI/ML models at scale. Dataflow-as-a-Service takes SambaNova’s expertise and learning and adds that to the client’s team. The service allows our clients to run cutting edge AI/ML applications without a large in-house team.

 

SP:
What makes you excited about the future with regards to AI? Next year, five years out, ten years out?

RL:
The transition to AI will be bigger than the advent of the Internet. It will touch every company in every industry, and create all sorts of opportunities. Yet when it comes to adopting AI, we’re basically in just the first third of an inning of an extra inning game. We can only imagine what the tech will look like in 5-10 years. By then, the majority of computers in datacenters will be running AI/ML tech like our products.

KO:
The center of gravity of AI/ML today is the big companies. But our technology will allow much smaller companies—and smaller systems—to run AI/ML.

RL:
Widespread technological adoption only happens when you make the tech easy for everyone to use. So this is our way of democratizing technology like AI/ML. Rather than exposing all of the complex technology to the end user, SambaNova delivers technology that the end user can plug in and learn within an hour.

SambaNova’s DataScale and Dataflow-as-a-Service provide vitally needed AI/ML-centric computing solutions for companies and organizations of all sizes. In addition, they make the technology easy to use, particularly for firms without large in-house teams. Their systems will leverage the power of DDR5 to provide the compute power required to realize the full potential of AI/ML. In so doing they will enable amazing new technologies that will positively impact our lives and help shape our future.

Learn more about Samsung DDR5 technology at www.samsungsemiconductor-us.com/ddr5.

Accelerating Scientific Applications With SambaNova Reconfigurable Dataflow Architecture

Artificial intelligence (AI)-driven science is an integral component in several science domains such as materials, biology, high energy physics, and smart energy. Science workflows can span one or more computational, observational, and experimental systems. The AI for Science report¹ put forth by a wide community of stakeholders from national laboratories, academia, and industry collectively stress the need for a tighter integration of AI infrastructure ecosystem with experimental and leadership computing facilities. The AI component of science applications, which generally deploy deep learning (DL) models, are unique and exhibit different characteristics from traditional industrial workloads. They implement complex models and typically incorporate hundreds of millions of model parameters. Data from simulations are usually sparse, multimodal, multidimensional, and exhibit temporal and spatial correlations. Moreover, AI-driven science applications benefit from flexible coupling of simulations with DL training or inference.

Such complexity of the AI for science workloads with increasingly large DL models is typically limited by traditional computing architectures. The adoption of novel AI architectures and systems aimed to accelerate machine learning models is critical to reduce the time-to-discovery for science.

The Argonne Leadership Computing Facility (ALCF), a US Department of Energy Office of Science user facility, provides supercomputing resources to power scientific breakthroughs. Applications with significant DL components are increasingly being run on existing supercomputers at the facility. Scientists at ALCF are exploring novel AI-hardware systems, such as SambaNova, in an attempt to address the challenges in scaling the performance of AI models.

KEY ATTRIBUTES FOR A NEXT-GENERATION ARCHITECTURE

Through academic research, analysis of technology trends, and knowledge developed in the design process, SambaNova identified the following key attributes to enable highly efficient dataflow processing.

  • Native dataflow—Commonly occurring operators in machine learning frameworks and domain-specific languages (DSL) can be described in terms of parallel patterns that capture parallelizable computation on both dense and sparse data collections along with corresponding memory access patterns. This enables exploitation and high utilization of the underlying platform while allowing a diverse set of models to be easily written in any machine learning framework of choice.
  • Support for terabyte-sized models—A key trend in DL model development uses increasingly large model sizes to gain higher accuracy and deliver more sophisticated functionality. For example, leveraging billions of datapoints (referred to as parameters) enables more accurate natural language generation. In the life sciences field, analyzing tissue samples requires the processing of large, high-resolution images to identify subtle features. Providing much larger on-chip and off-chip memory stores than those that are available on core-based architectures will accelerate DL
    innovation.
  • Efficient processing of sparse data and graph-based networks—Recommender systems, friend-of-friends problems, knowledge graphs, some life science domains, and more involve large sparse  data structures that consist of mostly zero values. Moving around and processing large, mostly empty matrices are inefficient and degrades performance. A next generation architecture must intelligently avoid unnecessary processing.
  • Flexible model mapping—Currently, data and model parallel techniques are used to scale workloads across the infrastructure. However, the programming cost and complexity are often prohibiting factors for new DL approaches. A new architecture should automatically enable scaling across infrastructure without this added development and orchestration complexity and avoid the need for model developers to become experts in system architecture and parallel computing.
  • Incorporate SQL and other predata/postdata processing—As DL models grow and incorporate a wider variety of data types, the dependence on preprocessing and postprocessing of data becomes dominant. Additionally, the time lag and cost of extract, transform, and load (ETL) operations impact real-time system goals. A new architecture should allow the unification of these processing tasks on a single platform.

New Approach: SambaNova, Reconfigurable Dataflow Architecture

The SambaNova reconfigurable data-flow architecture (RDA) is a computing architecture designed to enable the next generation of machine learning and high performance computing applications. The RDA is a complete, full-stack solution that incorporates innovations at all layers including algorithms, compilers, system architecture, and state-of-the-art silicon.

The RDA provides a flexible, dataflow execution model that pipelines operations, enables programmable data access patterns, and minimizes excess data movement found in fixed, core-based, instruction set architectures. It does not have a fixed instruction set architecture (ISA) like traditional architectures, but instead is programmed specifically for each model resulting in a highly optimized, application-specific accelerator.

The RDA is composed of the following.

SambaNova Reconfigurable Dataflow Unit (RDU) is a next-generation processor designed to provide native dataflow processing and programmable acceleration. It has a tiled architecture that comprises a network of reconfigurable functional units. The architecture enables a broad set of highly parallelizable patterns contained within dataflow graphs to be efficiently programmed as a combination of compute, memory, and communication networks.

The RDU is the engine that efficiently executes dataflow graphs. It consists of a tiled array of reconfigurable processing and memory units connected through a high-speed, 3-D on-chip switching fabric. When an application is started, SambaNova Systems SambaFlow software configures the RDU elements to execute an optimized dataflow graph for that specific application. Figure 1 shows a small portion of an RDU with its components described below.

Pattern compute unit (PCU)—The PCU is designed to execute a single, innermost-parallel operation in an application. The PCU data-path is organized as a multistage, reconfigured SIMD pipeline. This design enables each PCU to achieve high compute density and exploit both loop-level parallelism across lanes and pipeline parallelism across stages.

Pattern memory unit (PMU)—PMUs are highly specialized scratchpads that provide on-chip memory capacity and perform a number of specialized intelligent functions. The high PMU capacity and distribution throughout the PCUs minimizes data movement, reduces latency, increases bandwidth, and avoids off-chip memory accesses.

Figure showing SambaNova reconfigurable dataflow unit

Switching fabric—The high-speed switching fabric that connects PCUs and PMUs is composed of three switching networks: scalar, vector, and control. These switches form a 3-D network that runs in parallel to the rest of the units within an RDU. The networks differ in granularity of data being transferred; scalar networks operate at word-level granularity, vector networks at multiple word-level granularity, and control at bit-level granularity.

Address generator units (AGU) and coalescing units (CU)—AGUs and CUs provide the interconnect between RDUs and the rest of the system, including off-chip DRAM, other RDUs and the host processor. RDU-Connect provides a high-speed path between RDUs for efficient processing of problems that are larger than a single RDU. The AGUs and CUs working together with the PMUs enable RDA to efficiently process sparse and graph-based datasets.

Reconfigurability, exploitation of parallelism at multiple levels, and the elimination of instruction processing overhead gives RDUs their significant performance advantages over traditional architectures.

SambaFlow is a complete software stack designed to take input from standard machine learning frameworks such as PyTorch and TensorFlow. SambaFlow automatically extracts, optimizes, and maps dataflow graphs onto RDUs, allowing high performance to be obtained without the need for low-level kernel tuning. SambaFlow also provides an API for expert users and those who are interested in leveraging the RDA for workloads beyond machine learning. Figure 2 shows the components of SambaFlow and its components described below.

Illustration of SambaFlow

User entry points—SambaFlow supports the common open-source, machine learning frameworks, PyTorch, and TensorFlow. Serialized graphs from other frameworks and tools are also imported here.

Dataflow graph analyzer and dataflow graphs— Accepts models from the frameworks then analyzes the model to extract the dataflow graph. For each operator, the computation and communication requirements are determined, so the appropriate RDU resources can be allocated later. The analyzer determines the most efficient mappings of the operators and communication patterns to the RDU utilizing the spatial programming model. With knowledge of both the model architecture and the RDU architecture, the analyzer can also perform high-level, domain-specific optimizations like node fusion. The output of the Dataflow Graph Analyzer is an annotated Dataflow Graph that serves as the first intermediate representation (IR) passed to the Dataflow Compiler.

Template Compiler and Spatial Templates—For cases where operators are required but not available in the existing frameworks, new operators can be described via a high-level, tensor index notation API. The Template Compiler will then analyze the operator and generate an optimized dataflow implementation for the RDU, called a Spatial Template. The generated template includes bindings that enable the new operator to be used directly from the application code in the same way as built-in framework operators.

Dataflow Compiler, Optimizer, and Assembler—This layer receives annotated Dataflow Graphs and performs high-level transformations like meta-pipelining, multisection support, and parallelization. It also understands the RDU hardware attributes and performs low-level transforms, primarily placing and routing by mapping the graph onto the physical RDU hardware and then outputting an executable file. As before, a spatial programming approach is used to determine the most efficient location of RDU resources.

SambaNova Systems DataScale is a complete, rack-level, data-center ready accelerated computing system. Each DataScale system configuration consists of one or more DataScale nodes, integrated networking, and management infrastructure in a standards-compliant data center rack, referred to as the DataScale SN10- 8R. Additionally, SambaNova DataScale leverages open standards and common form factors to ease adoption and streamline deployment.

Deployment at Argonne

The SambaNova DataScale system deployed at ALCF is a DataScale SN108- R system consisting of two SN10-8 systems. Each SN10-T8 system consists of a host module and 8 RDUs. The RDUs on a system are interconnected via the RDU-Connect and the systems are interconnected using an InfiniBand-based interconnect. These together enable both model parallelism as well as data parallelism across the RDUs in the system.

We evaluated the SambaNova system with a diverse range of DL application models of interest to science. These application models also exhibit diverse characteristics in terms of the model architectures and parameters. Additionally, a Bidirectional Encoder Representations from Transformers (BERT) model was also evaluated.

SIMILAR TO THE UNO MODEL, UNET ALSO PERFORMS BETTER ON THE RDU THAN TRADITIONAL ARCHITECTURES.

CANDLE Uno: The Exascale DL and Simulation Enabled Precision Medicine for Cancer project (CANDLE)2 implements DL architectures that are relevant to problems in cancer. These architectures address problems at three biological scales: cellular, molecular, and population. The goal of the Uno model, part of the CANDLE project, is to build neural network-based models to predict tumor response to single and paired drugs, based on molecular features of tumor cells. It implements a DL architecture with 21 million parameters.

The Uno model performs well on the RDU for a variety reasons. First, the model has a large number of parameters, which can be served directly from on-chip SRAM. The RDU has 100 s of TB/s of bandwidth for repeated use in the network, which is much higher bandwidth than what is provided in other architectures. Second, the model has a reasonable number of nonsystolic operations. SambaFlow constructs a dataflow graph from these operations and schedules sections of the computational graph, providing very high efficiency in executing these operations, without requiring any manual kernel fusion.

UNet: UNet6 is a modified convolutional network architecture for fast and precise segmentation of images with fewer training samples. The upsampling operators in the model layers increase the resolution of the output. This model is commonly used in segmentation in imaging science applications, such as in accelerators and connectomics.

Similar to the Uno model, UNet also performs better on the RDU than traditional architectures. The large memory capacity of the RDU, which starts at 3 TB and goes to 12 TB per 8 RDUs, enables the RDU to handle hi-resolution images natively, without any compromise on image quality or batch-size. Additional, the data-flow architecture of the RDU provides for computation over overlapping pixels on the same device, without introducing any communication latency.

CosmicTagger: The CosmicTagger application3 in high energy particle physics domain deals with detecting neutrino interactions in a detector overwhelmed by cosmic particles. The goal is to classify each pixel to separate cosmic pixels, background pixels, and neutrino pixels in a neutrinos dataset. This uses multiple 2-D projections of the same 3-D particle tracks and the raw data are three images per event. The training model is a modified UResNet architecture for multiplane semantic segmentation and is available in both single node and distributed-memory multinode implementations.

Due to the high-resolution images in the neutrino detectors, the memory requirements for GPU training of the CosmicTagger application are high enough to exceed Nvidia V100 memory sizes in most configurations at full image resolution. During this evaluation, we demonstrated that the spatial partition of SambaFlow allows training at full resolution (as opposed to downsampled images on GPUs), leading to an improvement of state-of-the-art accuracy (mIoU) as seen in Figure 3.

Figure showing cosmic testing

Gravwaves: Multimessenger astrophysics project aims to observe astrophysical phenomena using gravitational waves and requires large-scale computing.4 This is achieved by the development of algorithms that significantly increase the depth and speed of gravitational wave searches and one that can process terabyte-sized datasets of telescope images in real-time. The model has a root-leaf architecture. The shared root component of the model is composed of seven convolutional layers, and its output is shared by the leaves. There are multiple leaf parts in the model for individual parameters. Each leaf part consists of multiple sequential fully connected layers with ReLU, identity, and TanH activations. The neural network structure includes a general feature extractor for the first seven layers; the subnetworks learn specialized features for all different physical parameters.

Gravwaves is another network that performs well on the RDU since the compute to communication ratio is low. On traditional architectures, the kernel-by-kernel execution method loses a lot of efficiency in scheduling the kernels on the device. Since the RDU schedules the whole graph on the device, a much higher execution efficiency is achieved. Additionally, the RDU data-flow architecture implements convolutions from various building blocks, which allows for the execution of more exotic convolution operations with the same high efficiency as standard convolution operations.

BERT: BERT5 is a neural network-based technique for natural language processing (NLP) pretraining. BERT makes use of a Transformer mechanism with attention that learns contextual relations between words. These architectures are being pursued to mine scientific literature in domains including biosciences, material science, among others. BERT involves various DL kernels and has 110–340 million parameters. Similar to the Uno model, BERT realizes benefits on the RDU due to the large number of parameters, which are served from on-chip SRAM, and how the RDU constructs and schedules the graph at a high level of efficiency.

CONCLUSION

Our exploratory work finds that the SambaNova RDA along with the SambaFlow software stack provides for an attractive system and solution to accelerate AI for science workloads. We have observed the efficacy of using the system with a diverse set of science applications and reasoned their suitability for performance gains over traditional hardware. As the DataScale system provides for a very large memory capacity, the system can be used to train models that typically do not fit in a GPU. The architecture also provides for deeper integration with upcoming supercomputers at the ALCF to help advance science insights.

Originally Published March 26, 2021 by the IEEE COMPUTER SOCIETY

ACKNOWLEDGEMENTS

This work was supported in part and used resources of the Argonne Leadership Computing Facility (ALCF), which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.

REFERENCES

  1. R. Stevens, V. Taylor, J. Nichols,, A. Maccabe, K. Yelick, and D. Brown, “AI for Science,” 2020. [Online]. Available: https://www.osti.gov/biblio/1604756
  2. Exascale deep learning and simulation enabled precision medicine for cancer. 2019. [Online]. Available: https://candle.cels.anl.gov/
  3. Neutrino and cosmic tagging with UNet. 2020. [Online]. Available: https://github.com/coreyjadams/ CosmicTagger/
  4. Deep learning at scale for multimessenger astrophysics. 2019. [Online]. Available: https://www.alcf.anl.gov/ science/projects/deep-learning-scale-multimessengerastrophysics-through-ncsa-argonne-collaboration/
  5. J. Delvin, M. Chang, K. Lee,, and K. Toutanova, “BERT: Pretraining of deep bidirectional transformers for language understanding, NAACL-HLT,” in Proc. 2019 Conf. North {A}merican Chapter Assoc. Comput. Linguistics: Human Lang. Technol., Volume 1, 2019, pp. 4171–4186, [Online]. Available: https://www.aclweb.org/anthology/N19-1423
  6. O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image segmentation,” in Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervention, 2015, pp. 234–241.

MURALI EMANI is currently a computer scientist with the Datascience Group, Argonne Leadership Computing Facility, Argonne, IL, USA. He received the Ph.D. degree in informatics from the University of Edinburgh, U.K. Contact him at memani@anl.gov.

VENKATRAM VISHWANATH is currently a computer scientist and head of the Datascience Group, Argonne Leadership Computing Facility, Argonne, IL, USA. He received the Ph.D. degree in computer science from the University of Illinois, Chicago, IL, USA. Contact him at venkat@anl.gov.

COREY ADAMS is currently a computer scientist with the Datascience group, Argonne Leadership Computing Facility. He received the Ph.D. degree from Yale University, New Haven, CT, USA. Contact him at corey.adams@anl.gov.

MICHAEL E. PAPKA is a senior scientist and division director of the Argonne Leadership Computing Facility and the PRSA Professor of Computer Science at Northern Illinois University. He received a Ph.D. degree in computer science from the University of Chicago. Contact him at papka@anl.gov.

RICK STEVENS is the Associate Laboratory Director for Computing, Environment and Life Sciences directorate at Argonne National Laboratory and Professor of Computer Science at the University of Chicago. Contact him at stevens@anl.gov.

LAURA FLORESCU is currently a principal engineer with SambaNova Systems, Palo Alto, CA, USA. She received the Ph.D. degree in computer science from New York University, New York, NY, USA. Contact her at laura.florescu@sambanova.ai.

SUMTI JAIRATH is currently a chief architect with SambaNova Systems, Palo Alto, CA, USA. Contact him at sumti.jairath@sambanova.ai.

WILLIAM LIU is currently a software engineer with SambaNova Systems, Palo Alto, CA, USA. He received the bachelor’s degree in cognitive science from Carnegie Mellon University, Pittsburgh, PA, USA. Contact him at william.liu@sambanova.ai.

TEJAS NAMA is currently a senior machine learning engineer with SambaNova Systems, Palo Alto, CA, USA. He received the master’s degree in computational data science from Carnegie Mellon University, Pittsburgh, PA, USA. Contact him at tejas.nama@sambanova.ai.

ARVIND SUJEETH is currently the Senior Director of Software Engineering with SambaNova Systems, Palo Alto, CA, USA. He received the Ph.D. degree in electrical engineering from Stanford University, Stanford, CA, USA. Contact him at arvind.sujeeth@sambanova.ai

Start Spreading the News… New York, New York!

What a beautiful view from Times Square today. Thanks for the shout out Nasdaq!

Image of Times Square with SambaNova announcement

New Capabilities Lead to a Model Quality Breakthrough

A Partnership with Argonne National Laboratory

Using the capabilities of SambaNova’s DataScale system, together, researchers at the U.S. Department of Energy’s Argonne National Laboratory and SambaNova have advanced state-of-the-art accuracy for an important neutrino physics image segmentation problem (see image below). The SambaNova DataScale system has been recently deployed as part of the Argonne Leadership Computing Facility’s (ALCF) AI-Testbed – an infrastructure of next-generation AI-accelerator machines to help evaluate usability and performance of machine learning-based high-performance computing applications. While Argonne researchers previously trained their model on graphics processing unit (GPU) based platforms, they were fundamentally restricted by the image size that those platforms could train on. In contrast, the reconfigurable dataflow architecture of SambaNova’s DataScale system seamlessly enables new capabilities for training on massive image sizes. In partnership with Argonne, we are using this capability to advance model quality on many important and challenging image processing problems.

In this blog post, we detail how the SambaNova DataScale system enabled Argonne to improve their model quality for the task of tagging cosmic pixels. While this blog post details a case study on a specific (neutrino physics) image processing problem, the techniques we use are generalizable to any convolutional neural network (CNN) on a SambaNova DataScale system. With high-resolution cameras and datasets becoming increasingly common, this removal of legacy barriers to high-resolution image processing is crucial.

Beyond State of the Art – Cosmic Tagger

Cosmic Background Removal with Deep Neural Networks in SBND” introduces a modified UResNet Architecture optimized for removing cosmic backgrounds from liquid argon time projection chamber (LArTPC) images. It is a classic image segmentation task to classify each input pixel into one of three classes – Cosmic, Muon, or Background. The original input images are 1280 pixels tall and 2048 pixels wide with 3 channels. Since the images to segment are so large, processing even a single batch runs out of memory on the GPU (V100).

To overcome this issue of training on GPUs, the authors had previously downsampled their input images to 50% resolution and trained the model with inputs containing 3x640x1024 pixels. However, this results in a loss of information which is crucial to this problem and many other sensitive domains such as medical imaging and astronomy (see accuracy drop in figure).

Image showing Multi-Plane UResNet

In contrast, the reconfigurable dataflow architecture of SambaNova’s DataScale system does not have these problems. The Argonne and SambaNova team are able to seamlessly train CNNs with images beyond 50k x 50k resolution. We use the same model, configuration, and hyperparameters, except we are able to use images with their original sizes without downsampling. For comparing the performance of different models, we use Mean Intersection over Union (MIoU) of only non-background pixels as the evaluation metric. From the results shown below, using larger images clearly outperforms the existing state-of-the-art model by close to 6% MIoU.

Even though the model on DataScale’s Reconfigurable DataFlow Unit (RDU) is trained at a lower precision (bfloat16) compared to GPU’s FP32, we are able to ensure stable convergence and achieve better results. Certain loss functions such as focal loss are inferior when using a lower batch size per replica. While GPUs (A100) can fit only one image per replica at full image sizes, RDUs let you train with up to 32 samples per replica and further improve accuracy.

Graphs showing Cosmic Testing results

Conclusion

With the advancement in technology, we now have access to datasets with images consisting of billions of pixels. This introduces new challenges in using deep learning and computer vision to process and utilize such abundant information. With minimal changes to the original code, SambaNova’s DataScale system provides a way to train deep CNN models with gigapixel images efficiently. Other computer vision tasks, such as classification and image superpixel resolution, would benefit greatly from the ability to train models without losing any information. This work is only a sneak peek at what is possible with high-resolution image training.

Acknowledgements:

This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.

Introducing SambaNova Systems DataScale: A New Era of Computing

With each generation, we have pushed the limits of innovation and discovery to make the incredible happen with technology. Some technology is truly revolutionary and paves the way for transformations we have yet to imagine.

On a global scale, the innovations materializing today from machine learning and AI will change the way we work and live forever. And within the tech sector, the need to pioneer these innovations has forced the entire industry to re-examine how we design next-generation infrastructure for complex machine learning workloads that require more than just transactional processing and raw performance.

When SambaNova Systems first set out, we had one goal: To make AI accessible to organizations of all sizes across all industries. And we are delivering on that promise. Today, with our technology, we are on the brink of one of the biggest transformations in computing history since the advent of the Internet.

We are proud and excited to introduce the world’s next-generation computing infrastructure—SambaNova Systems DataScale™. DataScale is ushering in a new era of computing and is giving us a clearer view of what the future of computing will look like.

What makes DataScale so special?

Unlike conventional hardware architectures, which present a fixed set of instructions for developers to piece together, software-defined hardware enables developers to think from a software-first perspective. This empowerment results in orders-of-magnitude improvements in efficiency and unlocks greater compute power to meet the rigorous demands of AI application development.

The SambaNova Systems Reconfigurable Dataflow Architecture™ (RDA) is the answer to the industry’s needs for a software-first approach and is the blueprint for DataScale. RDA is a spatially reconfigurable architecture designed to efficiently execute a broad range of AI applications and models of all sizes and forms.

While other AI infrastructure companies are focusing on just one technology component—the chip—DataScale is a complete, integrated software and hardware systems platform optimized for dataflow from algorithms to silicon.

With DataScale, rather than being constrained by the limitations of traditional hardware infrastructure, developers can focus on discovering new opportunities to innovate and accomplish what they once thought impossible.

Powered by SambaNova’s Reconfigurable Dataflow Unit™ (RDU), a next-generation processor built from the ground up to offer native dataflow processing, DataScale helps to future-proof your data center.

The reconfigurable and flexible characteristics of RDUs and the high-speed fabric that connects them means maximum system throughput and performance no matter what is thrown at them. Most importantly, it means a stack that can be optimized to meet the changing AI demands of the near future.

Our incredible customers are already doing great work with DataScale. Lawrence Livermore National Laboratory, for example, is coupling DataScale into its Corona supercomputing system, which is being used for COVID-19 drug discovery. And Los Alamos National Laboratory uses DataScale in modeling extremely complex quantum chemistry.

Achieving world record-breaking performance metrics

DataScale achieves record-breaking performance metrics from system level to multi-rack scale when compared to the latest, most advanced platforms used in four key areas: Performance, accuracy, scale, and ease of use. I invite you to read the press release for details.

With our launch, we are also introducing an industry-first—a subscription-based offering called Dataflow-as-a-Service (DaaS). DaaS is available in three monthly subscription types customized for natural language processing, high-res computer vision, or recommender systems. They are accessible in both cost and configuration, and deliver on SambaNova’s promise to make AI more accessible to organizations of all sizes across all industries.

In addition, we are also granting easy, powerful cloud access to both academics and researchers. SambaNova AI Cloud Platform for universities and research laboratories gives users access to all the power of DataScale without the physical hardware. We’re accepting research proposals now.

At SambaNova, we are proud of what our customers are accomplishing with DataScale. And we look forward to what we will continue to accomplish together as organizations all over the world are empowered by a new, better way of computing.

Accelerating the Modern Machine Learning Workhorse: Recommendation Inference

Updated January 31, 2021

Inference for recommender systems is perhaps the single most widespread machine learning workload in the world. Here, we demonstrate that using the SambaNova DataScale system, we can perform recommendation inference over 20x faster than the leading GPU on an industry-standard benchmark model. Keep watching us in this space as our software is rapidly evolving to deliver continuous improvements.  We will be sure to keep you updated.

The impact of this is massive from both technology and business standpoints. According to Facebook, 79% of AI inference cycles in their production data centers are devoted to recommendation (source). These engines serve as the primary drivers for user engagement and profit across numerous other Fortune 100 companies, with 35% of Amazon purchases and 75% of watched Netflix shows coming from recommendations (source).

Record-breaking Recommendation Speed

To measure the performance of the SambaNova DataScale system, we use the recommendation model from the MLPerf benchmark, the authoritative benchmark for machine learning researchers and practitioners. Their task for measuring recommendation performance uses the DLRM model on the Terabyte Clickthrough dataset. Since Nvidia has not reported A100 numbers, we measure an Nvidia optimized version of this model (source) running on a single A100 that is deployed using a Triton Server (version 20.06) with FP16 precision. We run this at a variety of batch sizes as this simulates a realistic deployed inference scenario. For V100 numbers, we use the FP16 performance results reported from Nvidia (source).

Low batch sizes are often needed in deploy scenarios as queries are streamed in real time and latency is critical. At these low batch sizes, the benefit of the dataflow architecture is clear and the SambaNova DataScale system commands 20x faster performance than a single A100 at batch size 1.

While online inference at batch size 1 is a common use case in deployed systems, customers also often want to batch some of their data to improve the overall throughput of the system. To demonstrate the benefits of the SambaNova DatasScale system, we also show the same DLRM benchmark at a batch size of 4k. At this higher batch size, the DataScale achieves over 2x faster performance than an A100 for both throughput and latency.

The Combined Solution: Training and Inference Together
While many of these measurements are geared towards MLPerf’s inference task, the DataScale system excels at both inference and training. By retraining the same DLRM model from scratch, and exploring variations which aren’t possible at all on GPU hardware, the RDU handily exceeds State of the Art. Check out this article to find out more.

Beyond the Benchmark: Recommendation Models in Production
The MLPerf DLRM benchmark simulates a realistic recommendation task, but it cannot capture the scale of a real deployed workload. In an analysis of these recommendation systems, Facebook writes that “production-scale recommendation models have orders of magnitude more embeddings” compared to benchmarks (source). As these models grow, CPUs and GPUs start to falter. Yet the DataScale system has no problem handling these larger compute and memory requirements, and continues to be a long-term solution that’s built to scale.

Premier Research Labs Push AI to Fight Disease—and Improve Lives

U.S. Department of Energy Accelerates AI With SambaNova Systems

Scientific researchers are exploring ways to combine artificial intelligence (AI) and machine learning (ML) for running complex scientific workloads to gain better performance and efficiency. To solve this problem, the United States Department of Energy’s National Nuclear Security Administration (DOE/NNSA), Lawrence Livermore National Laboratory (LLNL), and Los Alamos National Laboratory (LANL) announced a strategic partnership. The cornerstone of this partnership agreement is multiple installations of SambaNova Systems DataScale™.

SambaNova DataScale is a complete, integrated software and hardware systems platform optimized for dataflow from algorithms to silicon. LLNL is coupling DataScale into its Corona supercomputing system. Initial focus has been on using DataScale for National Ignition Facility applications. Corona is primarily being used for COVID-19 drug discovery and LLNL plans to apply DataScale to this workload.

Improved Performance, Accuracy, and Productivity With SambaNova DataScale

SambaNova DataScale is improving overall performance, accuracy, and productivity for these demanding research institutions.

It’s no surprise, as SambaNova DataScale is designed for both efficient deep-learning inference and training calculations. It features the SambaFlow™ software stack and the world’s first Reconfigurable Dataflow Unit, the Cardinal SN10™ RDU. The system contains eight RDUs—each one capable of supporting multiple simultaneous jobs or working seamlessly together to execute large-scale models.

 

Image of SambaNova DataScale

 

Ian Karlin is the principal HPC strategist at LLNL. After bringing SambaNova DataScale on-site in September, he has already reported that early tests have shown DataScale to be 5X or better when normalized against GPUs.

Karlin says DataScale was the right choice for LLNL for several reasons; chief among them was the integrated software and hardware systems and the ability to do both training and inference on one platform.

Computer scientist and LLNL Informatics Group Leader, Brian Van Essen explains, “We selected SambaNova for this procurement because one of the key features they have is the ability to do training and inference on small batch sizes. Inference at small scales is key; training on small batches is important for retraining and fine-tuning models. That’s something we’ll be doing.” He also cites “maturity of the programming model and the team’s expertise with the software stack” as a crucial aspect of the LLNL’s two-year engagement with SambaNova.

Over at LANL, the first application targeted for acceleration with DataScale is modeling quantum chemistry with density-functional theory (DFT)-level accuracy. LANL has developed a workflow for building machine learning models of interatomic energies and forces to enable molecular dynamics (MD) simulations with high accuracy in a computationally efficient manner. These ML models are very faithful to DFT reference calculations and enable reactive chemistry from first principles in support of materials science, chemistry, molecular biology, and drug design.

As reported, these calculations currently run on GPU hardware and are showing further promise of acceleration with the SambaNova DataScale system. An ongoing collaboration between SambaNova Systems and LANL scientists suggests the possibility of up to 5X speedup compared to the existing GPU implementation.

Exploring Breakthrough Advances

LLNL researchers are using SambaNova DataScale to continue exploring the combination of high-performance computing (HPC) and AI, an innovative effort LLNL calls “cognitive simulation” (CogSim). Researchers said the two systems working in tandem will enable more streamlined computation and allow them to move applications into this new computing model.

SambaNova DataScale’s ability to run dozens of inference models at once while performing scientific calculations on the Corona system will aid in their quest to use machine learning to accelerate key applications.

According to LLNL researchers, SambaNova DataScale will be used in the small molecule drug design work being applied to COVID-19 at LLNL, as well as to cancer through the ATOM (Accelerating Therapeutics for Opportunities in Medicine) project. Recent work has produced a machine learning model to improve COVID-19 drug design that uses small batch training. This is important for this type of model that converges best at small batches. SambaNova DataScale has the capability for efficient small batch training—a key differentiating feature that sets it apart from GPUs. This work will be integrated into drug design loops that generate new potential compounds that then are evaluated for safety and efficacy using HPC simulations on the Corona system.

LLNL’s COVID-19 machine learning model is a finalist for the Gordon Bell Special Prize for High Performance Computing-Based COVID-19 Research, which will be announced on Nov. 19.

The AI research taking place at labs such as LLNL and LANL is not unique to the public sector. Using similar techniques, forward-thinking enterprises are advancing their own AI initiatives and making significant progress.

Here at SambaNova Systems, we’re excited about the collaboration with DOE/NNSA, LLNL, and LANL. “SambaNova Systems is providing the platform for innovation to enable visionaries to achieve breakthrough advancements in their domains,” says Rodrigo Liang, our co-founder, and CEO.

Our partnership with the U.S. Department of Energy is just one example of how we are enabling this.

Repeatable Machine Learning by Design With SambaNova Systems DataScale

If you run the same machine learning application twice with the same inputs, initializations, and random seeds, you often will not get the same result. While randomness or stochasticity in machine learning applications is a desired quality, this non-repeatability is not. For example, in mission-critical applications such as autonomous driving, this non-repeatable behavior can have disastrous implications for model explainability, especially if an audit is required to analyze certain important decisions post hoc. As a result, there is, unsurprisingly, a repeatability crisis occurring in many machine learning domains. At SambaNova Systems, we believe that your architecture should not be subtly adding to this problem in the pursuit of peak performance; hardware architectures should never trade-off repeatable computation for performance.

In the following sections, we define repeatability and why the problem arises. After that, we dive into single- and multi-socket cases that highlight the repeatability problem. In each case we show how the SambaNova Systems DataScale SN10-8R with the world’s first Reconfigurable DataFlow Unit (RDU) provides repeatability as an artifact of its design.

Repeatability: A repeatable machine learning program is one where the exact same behavior is observed when user-controlled variables are fixed (same random seeds, same initializations, same inputs, same machine). Think of this as test-retest reliability. This is different from stochasticity or randomness—which are important features that an RDU also enables in a repeatable manner for machine learning applications.

The Problem: Floating-point arithmetic is not associative; that is, the order in which you add floating-point numbers can change the final output.

(A + B) + C ≠ A + (B + C)

This fundamental property of floating-point arithmetic often leads to non-repeatability, especially when complex parallelization primitives or out-of-order execution are inherent in a hardware architecture. Although there are software solutions to ensure repeatability on traditional hardware architectures, users must comb through large amounts of documentation to figure out which operations might cause the problem and/or incur a slight performance degradation to ensure repeatability. Even worse, this problem is exposed in different ways depending on if one is running single or multi socket machine learning applications. In the next sections, we highlight the problem and the RDU’s solution in each case.

Dataflow is the Key to Single Socket Repeatability on an RDU

Due to complex parallelization primitives and out of order execution, high-performance kernels are often non-repeatable on GPU-based architectures. On an RDU, our computation is pure dataflow and, as such, our kernels are always repeatable. To highlight the repeatability of an RDU-based architecture, we compare the behavior of both an RDU and a GPU on a popular machine learning kernel seen in many customer applications. At the core of this kernel is the index add operator used in many popular models, and is an essential, fundamental indexing operation. In figure 1, we show the repeatability measured when running the popular tensor operations 5 times with the same input on both a GPU and RDU. To measure repeatability, we use the Frobenius-Norm (a popular tensor metric) of the input gradients as a proxy and compare this norm across multiple runs. As Fig.1 shows, on the RDU’s dataflow architecture, computational parallelism is always handled in a repeatable way. There is zero difference among the runs, and repeatable results are achieved during every iteration. This is one of many examples that highlight the dataflow architecture employed by an RDU is repeatable while traditional hardware architectures are not necessarily so.

Graph showing Repeatability in single-socket computation

Fig. 1. Repeatability in single-socket computation between a V100 GPU running with CUDA 10.1 and PyTorch 1.6.0 and an RDU with the latest release of SambaFlow.

 

RDU Model Parallelism for Repeatable Multiple Socket Training

While data parallelism remains the most popular form of multi-socket parallelism for machine learning training, model parallelism is emerging as a popular new form of parallelism for multi-socket training. Although there are many inherent benefits of model parallelism, one rarely discussed benefit of model parallelism is that it is naturally more repeatable than data parallelism. This is because data parallelism must synchronize and add gradients across multiple sockets, which can create non-repeatable results (true in popular frameworks such as Horovod). Like traditional architectures, RDUs seamlessly support both data and model parallelism. Unlike traditional architectures, model parallelism was always a core tenet of the RDU’s design.

In Fig. 2, we show what can happen due to this non-repeatability of data parallel training. In Fig. 2, we also plot the meaning of the top 200 words in a BERT language model after fine tuning the model using 8-socket training. Under the exact same conditions (same data, same machine, etc.), the meaning of these words is noticeably changing (or drifting) in the data parallel case, a very undesirable behavior for a production model. In contrast, RDU model parallel training retains fidelity between each word’s meaning across runs, which is what one would expect and desire. This is not only a desirable behavior for our customers, but a necessary one for those running mission-critical workloads.

Data Parallel vs. Model Parallel Repeatability

Comparison of word embedding training repeatability (semantic drift) for (a) 8-socket GPU and (b) 8-socket RDU

Fig. 2. Comparison of word embedding training repeatability (semantic drift) for (a) 8-socket GPU and (b) 8-socket RDU

 

The Solution

Based on our own first-hand experiences working in collaboration with partners in our research laboratory, we made repeatability a core feature when designing the SambaNova Systems Reconfigurable Dataflow Unit (RDU) architecture. Our belief is that repeatability and performant hardware is paramount for organizations focused on fast-paced innovation. Therefore, when using RDUs, users do not need to worry about repeatability issues because not only are we more performant than many traditional architectures, but repeatability is a core tenet of our design that can be achieved without compromise.

Surpassing State-of-the-Art Accuracy in Recommendation Models

Recommender systems are a ubiquitous part of many common and broadly used internet services. They are utilized in retail and e-commerce applications to cross-sell and up-sell products and services. Online consumer services for ridesharing, peer reviews, and banking services rely heavily on recommendation models to deliver fast and efficient customer experiences. Everyday examples of recommender systems offering users hit or miss advice on social media, news sites, etc. are abundant. This is because a company’s ability to provide richer, more meaningful recommendations requires many more attributes to be incorporated into a recommendation system beyond just a user’s browsing or purchase history. This seems simple and intuitive enough. However, real-world implementations with legacy technology components can diminish efforts to achieve state-of-the-art accuracy.

Recommendation Tasks Place Huge Demands on Both Memory and Computation
The backbone that enables recommendation models to encode such massive volumes of data is the embedding. Embedding tables are large numerical tables that contain encodings of every feature in the data – every user, product, region, etc. It’s well known that larger embedding tables lead to better model quality by making them more expressive and accurate. In order to fully capture all of the information in their data, SambaNova’s industry partners easily utilize embeddings that are hundreds of gigabytes in size—often terabytes!

These embeddings are attached to deep neural networks which perform a large number of calculations in order to generate the final recommendation result.

The Benchmark
As a demonstration, we used the Sambanova DataScale system, which is a complete integrated software and hardware system, to train the Deep Learning Recommendation Model (DLRM) on the Criteo Terabyte Clicklogs dataset. This is the MLPerf standard benchmark for recommendation, where the performance metric is AUC on a test set.

NOTE: Despite containing ~1TB in data and ~100GB in embedding features, it’s important to note that this dataset still does not represent a real large-scale production workload. Deployed systems are at least 5x more demanding in terms of both data and embedding sizes. But rest-assured—SambaNova Systems Reconfigurable Dataflow Unit (RDU) and the SambaNova DataScale system are built to scale and are well-equipped to tackle those gigantic use cases too.

Unleashing the Power of Embeddings
It’s known that increasing embedding dimensions improves recommender model accuracy at the cost of model size. Many recent studies have been devoted to sharding the model or reducing the embedding dimensions to fit in GPU memory. SambaNova Systems researchers have pioneered superior methods for solving this problem via vertical engineering through our integrated software and hardware stack. We demonstrate this by exceeding state-of-the-art accuracy on the DLRM model by significantly increasing the embedding dimensionality. In an ablation study where everything else is held constant, we find that the model’s accuracy strictly increases with embedding dimensions when trained on a single SambaNova Systems RDU. Meanwhile, on a single GPU, model execution attempts result in catastrophic failure.

Fig 1: Effects of Embeddings dimensions on single RDU and single GPU

Fig 1: Effects of Embeddings dimensions on single RDU and single GPU

Exploring New Batch Sizes and Breaking the GPU Mold
Popular training techniques place a large focus on increasing mini-batch size to saturate GPU computation. For example, Nvidia’s demo implementation of DLRM uses batch sizes of 32768 and higher.

From a statistical standpoint, this isn’t always the preferred decision. As studied, decreasing the batch size can actually have strong benefits, helping a model avoid sharp minima so it can generalize more effectively. When training DLRM on the SambaNova Systems RDU, we observed noticeable improvements in validation performance when decreasing the batch size.

Fig 2: Enhanced RDU performance with batch size reduction

Fig 2: Enhanced RDU performance with batch size reduction

In reality, machine learning researchers and engineers choose these giant, suboptimal batch sizes because their current infrastructure leaves them no alternative. The GPU’s kernel-oriented execution suffers significantly when batch size decreases. On the other hand, with the SambaNova Systems RDU’s Dataflow architecture and intelligent software stack, system resources can still be fully utilized and achieve strong throughput regardless of batch size.

Fig 3: Negligible throughput degradation on RDU compared to GPU with smaller batch size

Fig 3: Negligible throughput degradation on RDU compared to GPU with smaller batch size

A New State of the Art
By combining our findings from above, we can use the SambaNova Systems RDU to train a new variant of DLRM that achieves a validation AUC of 0.8046 on the Criteo Terabyte dataset. In comparison, the best AUC reported by NVIDIA in their MLPerf submission is 0.8027. This unique large-embedding, small-batch model would be impossible to run on a GPU, and impractical to run on a CPU.

Fig 4: RDU exceeds MLPerf and GPU thresholds when training a new DLRM variant

Fig 4: RDU exceeds MLPerf and GPU thresholds when training a new DLRM variant

In addition to having a noticeably higher peak AUC, the new and improved DLRM also converges much faster.

Powering Next Generation of Recommender Models
The SambaNova Systems robust yet performant RDU technology enables machine learning engineers to explore an entirely new world of models, unlocking results that surpass current state of the art. When applied to business-critical recommender models, this leads to significant enhancements in business outcomes and huge boosts in revenue. In Tencent’s words, “The reason we care about small amount AUC increase is that in several real-world applications we run internally, even 0.1% increase in AUC will have a 5x amplification (0.5% increase) when transferred to final CTR”.

Stay on Top of AI

Sign up for AI trends,
information and company news.

    Thank
    you

    for signing up.
    We will keep you posted.