Inference Performance for Llama 3.1 405B in Function Calling & Agentic Workflows
Today, we launched the SambaNova Cloud for developers delivering the fastest inference on Llama 3.1 405B. You should sign up for SambaNova’s free service at cloud.sambanova.ai if you are an AI developer looking to integrate your advanced applications into the latest LLMs like 405B.
Large language models (LLMs) have revolutionized the field of natural language processing (NLP) with their impressive capabilities in language understanding, generation, and reasoning. However, as these models become increasingly complex and computationally expensive, their inference performance has become a critical bottleneck in many applications. In this blog post, we will explore why inference performance is crucial for LLMs in advanced applications like function calling and agentic workflows, and we will discuss the implications of slow inference performance.
Function Calling Needs Fast Inference
Function calling is a fundamental concept in programming where a program invokes a separate block of code to perform a specific task. Some examples of this include:
- Programmatically reading a spreadsheet via API
- Executing a Python standard shell, or REPL (Read-Eval-Print Loop) tool
- Running a SQL query tool
In the context of LLMs, a model returns structured output to be used for external function calling to retrieve information, perform computations, or interact with the environment. This capability is essential for many applications, such as question answering, text generation, and dialogue systems.
Function calling in LLMs is only useful if the model can infer the correct function to call and the relevant arguments to pass. This requires fast and accurate inference performance, as the model needs to quickly process the input and generate the correct output. Slow inference performance can lead to delayed responses, which can be frustrating for users and limit the overall effectiveness of the system.
Moreover, function calling often involves recursive or iterative processes, where the model needs to call multiple functions in sequence or repeatedly invoke the same function with different arguments. An example application may iterate 5 times with various arguments in the function call, which require 500 tokens input, and produce 500 tokens output. In this example application with the SambaNova Cloud the time to complete each request would be 5 seconds, and the overall time would be 25 seconds. Other solutions available today providing Llama 3.1 405B offer an average speed of 28 tokens per second, which would require over 35 seconds per iteration and an overall time of nearly 3 minutes. You can see that in such cases, slow inference performance can lead to exponential delays, rendering the system unusable.
SambaNova Cloud delivers Llama 3.1 405B up to 200 tokens per second, try our AI Starter Kit Function Calling example with a Streamlit UI or jupyter notebook.
Function Calling Starter Kit Streamlit UI
Agentic Workflows: Need Near Real-Time Inference
Agentic workflows refer to the ability of LLMs to interact with external agents, such as humans, other models, or physical devices, to achieve a common goal. This requires the model to understand the context, reason about the situation, and generate responses in real-time. Agentic workflows are critical in applications such as customer service, autonomous vehicles, and smart homes.
In agentic workflows, inference performance is crucial, as the model needs to respond quickly to changing situations and adapt to new information. Slow inference performance can lead to delayed responses, which can have serious consequences, such as accidents in autonomous vehicles or frustrated customers in customer service.
Moreover, agentic workflows often involve multiple agents interacting with each other, which requires fast and accurate inference performance to ensure seamless communication and coordination. Slow inference performance can lead to misunderstandings, miscommunications, or even system failures.
An example of an Agentic Workflow using LangGraph, available in SambaNova’s github.
Hierarchical Corrective RAG and Search
Implications of Slow Inference Performance
Slow inference performance in LLMs can have significant implications for function calling and agentic workflows. Some of the potential consequences include:
- Delayed responses: Slow inference performance can lead to delayed responses, which can be frustrating for users and limit the overall effectiveness of the system.
- System failures: Slow inference performance can lead to system failures, particularly in agentic workflows where multiple agents interact with each other.
- Accidents: Slow inference performance can lead to accidents, particularly in autonomous vehicles or other safety-critical applications.
- Loss of trust: Slow inference performance can lead to loss of trust in the system, particularly if users experience repeated delays or failures.
Conclusion
Inference performance is a critical aspect of large language models, particularly in function calling and agentic workflows. Overcome the challenges of slow inference performance by using SambaNova Cloud at cloud.sambanova.ai reduce delayed responses, system failures, accidents, and loss of trust. By improving inference performance, SambaNova has unlocked the full potential of Llama 3.1 405B and enabled effective and efficient advanced applications.