Accelerating the Modern Machine Learning Workhorse: Recommendation Inference
Inference for recommender systems is perhaps the single most widespread machine learning workload in the world. Here, we demonstrate that using the SambaNova DataScale system, we can perform recommendation inference 6.9x faster than the leading GPU on an industry-standard benchmark model.
The impact of this is massive, both from both technology and business standpoints. According to Facebook, 79% of AI inference cycles in their production data centers are devoted to recommendation (source). These engines serve as the primary drivers for user engagement and profit across numerous other Fortune 100 companies, with 35% of Amazon purchases and 75% of watched Netflix shows coming from recommendations (source).
Record-breaking Recommendation Speed at Batch Size 1
To measure the performance of the SambaNova DataScale system we use the recommendation model from the MLPerf benchmark, the authoritative benchmark for machine learning researchers and practitioners. Their task for measuring recommendation performance uses the DLRM model on the Terabyte Clickthrough dataset. We measure an Nvidia optimized version of this model (source) running on a single A100 deployed using Triton Server (version 20.06). We run this at batch size 1 as this simulates a realistic deployed inference scenario, where queries are streamed in real time and latency is critical. When comparing the results of a single A100 GPU against the SambaNova DataScale system, DataScale achieves 6.9x faster performance for both throughput and latency.
Record-breaking Recommendation Speed at Batch Size 4k
While online inference at batch size 1 is a common use case in deployed systems, customers also sometimes want to batch some of their data to improve the overall throughput of the system. To demonstrate the benefits of the SambaNova DatasScale system, regardless of the batch size, we also measured the same DLRM benchmark at a batch size of 4k. At this higher batch size the DataScale achieves 2.7x faster performance than an A100 for both throughput and latency.
The Combined Solution: Training and Inference Together
While many of these measurements are geared towards MLPerf’s inference task, the DataScale system excels at both inference and training. By retraining the same DLRM model from scratch, and exploring variations which aren’t possible at all on GPU hardware, the RDU handily exceeds State of the Art. Check out this article to find out more.
Beyond the Benchmark: Recommendation Models in Production
The MLPerf DLRM benchmark simulates a realistic recommendation task, but it cannot capture the scale of a real deployed workload. In an analysis of these recommendation systems, Facebook writes that “production-scale recommendation models have orders of magnitude more embeddings” compared to benchmarks (source). As these models grow, CPUs and GPUs start to falter. Yet the DataScale system has no problem handling these larger compute and memory requirements, and continues to be a long-term solution that’s built to scale.