Accelerating the Modern Machine Learning Workhorse: Recommendation Inference
Updated January 31, 2021
Inference for recommender systems is perhaps the single most widespread machine learning workload in the world. Here, we demonstrate that using the SambaNova DataScale™ system, we can perform recommendation inference over 20x faster than the leading GPU on an industry-standard benchmark model. Keep watching us in this space as our software is rapidly evolving to deliver continuous improvements. We will be sure to keep you updated.
The impact of this is massive from both technology and business standpoints. According to Facebook, 79% of AI inference cycles in their production data centers are devoted to recommendation (source). These engines serve as the primary drivers for user engagement and profit across numerous other Fortune 100 companies, with 35% of Amazon purchases and 75% of watched Netflix shows coming from recommendations (source).
Record-breaking Recommendation Speed
To measure the performance of the SambaNova DataScale system, we use the recommendation model from the MLPerf benchmark, the authoritative benchmark for machine learning researchers and practitioners. Their task for measuring recommendation performance uses the DLRM model on the Terabyte Clickthrough dataset. Since Nvidia has not reported A100 numbers, we measure an Nvidia optimized version of this model (source) running on a single A100 that is deployed using a Triton Server (version 20.06) with FP16 precision. We run this at a variety of batch sizes as this simulates a realistic deployed inference scenario. For V100 numbers, we use the FP16 performance results reported from Nvidia (source).
Low batch sizes are often needed in deploy scenarios as queries are streamed in real time and latency is critical. At these low batch sizes, the benefit of the dataflow architecture is clear and the SambaNova DataScale system commands 20x faster performance than a single A100 at batch size 1.
While online inference at batch size 1 is a common use case in deployed systems, customers also often want to batch some of their data to improve the overall throughput of the system. To demonstrate the benefits of the SambaNova DatasScale system, we also show the same DLRM benchmark at a batch size of 4k. At this higher batch size, the DataScale achieves over 2x faster performance than an A100 for both throughput and latency.
The Combined Solution: Training and Inference Together
While many of these measurements are geared towards MLPerf’s inference task, the DataScale system excels at both inference and training. By retraining the same DLRM model from scratch, and exploring variations which aren’t possible at all on GPU hardware, the RDU handily exceeds State of the Art. Check out this article to find out more.
Beyond the Benchmark: Recommendation Models in Production
The MLPerf DLRM benchmark simulates a realistic recommendation task, but it cannot capture the scale of a real deployed workload. In an analysis of these recommendation systems, Facebook writes that “production-scale recommendation models have orders of magnitude more embeddings” compared to benchmarks (source). As these models grow, CPUs and GPUs start to falter. Yet the DataScale system has no problem handling these larger compute and memory requirements, and continues to be a long-term solution that’s built to scale.