Language Mannequin (LLM) will not be essentially the ultimate step in productionizing your Generative AI software. An typically forgotten, but essential a part of the MLOPs lifecycle is correctly load testing your LLM and guaranteeing it is able to stand up to your anticipated manufacturing visitors. Load testing at a excessive stage is the observe of testing your software or on this case your mannequin with the visitors it could expect in a manufacturing atmosphere to make sure that it’s performant.
Previously we’ve mentioned load testing traditional ML models utilizing open supply Python instruments equivalent to Locust. Locust helps seize basic efficiency metrics equivalent to requests per second (RPS) and latency percentiles on a per request foundation. Whereas that is efficient with extra conventional APIs and ML fashions it doesn’t seize the complete story for LLMs.
LLMs historically have a a lot decrease RPS and better latency than conventional ML fashions as a result of their dimension and bigger compute necessities. Normally the RPS metric does not likely present probably the most correct image both as requests can drastically fluctuate relying on the enter to the LLM. As an example you may need a question asking to summarize a big chunk of textual content and one other question which may require a one-word response.
Because of this tokens are seen as a way more correct illustration of an LLM’s efficiency. At a excessive stage a token is a piece of textual content, at any time when an LLM is processing your enter it “tokenizes” the enter. A token differs relying particularly on the LLM you’re utilizing, however you’ll be able to think about it as an illustration as a phrase, sequence of phrases, or characters in essence.

What we’ll do on this article is discover how we will generate token based mostly metrics so we will perceive how your LLM is acting from a serving/deployment perspective. After this text you’ll have an concept of how one can arrange a load-testing software particularly to benchmark totally different LLMs within the case that you’re evaluating many fashions or totally different deployment configurations or a mixture of each.
Let’s get arms on! If you’re extra of a video based mostly learner be at liberty to observe my corresponding YouTube video down under:
NOTE: This text assumes a fundamental understanding of Python, LLMs, and Amazon Bedrock/SageMaker. If you’re new to Amazon Bedrock please discuss with my starter information here. If you wish to study extra about SageMaker JumpStart LLM deployments discuss with the video here.
DISCLAIMER: I’m a Machine Studying Architect at AWS and my opinions are my very own.
Desk of Contents
- LLM Particular Metrics
- LLMPerf Intro
- Making use of LLMPerf to Amazon Bedrock
- Further Sources & Conclusion
LLM-Particular Metrics
As we briefly mentioned within the introduction with reference to LLM internet hosting, token based mostly metrics typically present a a lot better illustration of how your LLM is responding to totally different payload sizes or kinds of queries (summarization vs QnA).
Historically we’ve all the time tracked RPS and latency which we’ll nonetheless see right here nonetheless, however extra so at a token stage. Listed below are a number of the metrics to concentrate on earlier than we get began with load testing:
- Time to First Token: That is the period it takes for the primary token to generate. That is particularly useful when streaming. As an example when utilizing ChatGPT we begin processing info when the primary piece of textual content (token) seems.
- Complete Output Tokens Per Second: That is the entire variety of tokens generated per second, you’ll be able to consider this as a extra granular various to the requests per second we historically observe.
These are the most important metrics that we’ll deal with, and there’s just a few others equivalent to inter-token latency that will even be displayed as a part of the load exams. Be mindful the parameters that additionally affect these metrics embody the anticipated enter and output token dimension. We particularly play with these parameters to get an correct understanding of how our LLM performs in response to totally different technology duties.
Now let’s check out a software that permits us to toggle these parameters and show the related metrics we’d like.
LLMPerf Intro
LLMPerf is constructed on prime of Ray, a well-liked distributed computing Python framework. LLMPerf particularly leverages Ray to create distributed load exams the place we will simulate real-time manufacturing stage visitors.
Be aware that any load-testing software can also be solely going to have the ability to generate your anticipated quantity of visitors if the consumer machine it’s on has sufficient compute energy to match your anticipated load. As an example as you scale the concurrency or throughput anticipated to your mannequin, you’d additionally need to scale the consumer machine(s) the place you’re operating your load take a look at.
Now particularly inside LLMPerf there’s just a few parameters which might be uncovered which might be tailor-made for LLM load testing as we’ve mentioned:
- Mannequin: That is the mannequin supplier and your hosted mannequin that you simply’re working with. For our use-case it’ll be Amazon Bedrock and Claude 3 Sonnet particularly.
- LLM API: That is the API format during which the payload ought to be structured. We use LiteLLM which supplies a standardized payload construction throughout totally different mannequin suppliers, thus simplifying the setup course of for us particularly if we need to take a look at totally different fashions hosted on totally different platforms.
- Enter Tokens: The imply enter token size, you can too specify a regular deviation for this quantity.
- Output Tokens: The imply output token size, you can too specify a regular deviation for this quantity.
- Concurrent Requests: The variety of concurrent requests for the load take a look at to simulate.
- Check Length: You’ll be able to management the period of the take a look at, this parameter is enabled in seconds.
LLMPerf particularly exposes all these parameters via their token_benchmark_ray.py script which we configure with our particular values. Let’s have a look now at how we will configure this particularly for Amazon Bedrock.
Making use of LLMPerf to Amazon Bedrock
Setup
For this instance we’ll be working in a SageMaker Classic Notebook Instance with a conda_python3 kernel and ml.g5.12xlarge occasion. Be aware that you simply need to choose an occasion that has sufficient compute to generate the visitors load that you simply need to simulate. Be certain that you even have your AWS credentials for LLMPerf to entry the hosted mannequin be it on Bedrock or SageMaker.
LiteLLM Configuration
We first configure our LLM API construction of selection which is LiteLLM on this case. With LiteLLM there’s assist throughout numerous mannequin suppliers, on this case we configure the completion API to work with Amazon Bedrock:
import os
from litellm import completion
os.environ["AWS_ACCESS_KEY_ID"] = "Enter your entry key ID"
os.environ["AWS_SECRET_ACCESS_KEY"] = "Enter your secret entry key"
os.environ["AWS_REGION_NAME"] = "us-east-1"
response = completion(
mannequin="anthropic.claude-3-sonnet-20240229-v1:0",
messages=[{ "content": "Who is Roger Federer?","role": "user"}]
)
output = response.selections[0].message.content material
print(output)
To work with Bedrock we configure the Mannequin ID to level in direction of Claude 3 Sonnet and cross in our immediate. The neat half with LiteLLM is that messages key has a constant format throughout mannequin suppliers.
Put up-execution right here we will deal with configuring LLMPerf for Bedrock particularly.
LLMPerf Bedrock Integration
To execute a load take a look at with LLMPerf we will merely use the offered token_benchmark_ray.py script and cross within the following parameters that we talked of earlier:
- Enter Tokens Imply & Commonplace Deviation
- Output Tokens Imply & Commonplace Deviation
- Max variety of requests for take a look at
- Length of take a look at
- Concurrent requests
On this case we additionally specify our API format to be LiteLLM and we will execute the load take a look at with a easy shell script like the next:
%%sh
python llmperf/token_benchmark_ray.py
--model bedrock/anthropic.claude-3-sonnet-20240229-v1:0
--mean-input-tokens 1024
--stddev-input-tokens 200
--mean-output-tokens 1024
--stddev-output-tokens 200
--max-num-completed-requests 30
--num-concurrent-requests 1
--timeout 300
--llm-api litellm
--results-dir bedrock-outputs
On this case we preserve the concurrency low, however be at liberty to toggle this quantity relying on what you’re anticipating in manufacturing. Our take a look at will run for 300 seconds and put up period it is best to see an output listing with two information representing statistics for every inference and in addition the imply metrics throughout all requests within the period of the take a look at.
We will make this look slightly neater by parsing the abstract file with pandas:
import json
from pathlib import Path
import pandas as pd
# Load JSON information
individual_path = Path("bedrock-outputs/bedrock-anthropic-claude-3-sonnet-20240229-v1-0_1024_1024_individual_responses.json")
summary_path = Path("bedrock-outputs/bedrock-anthropic-claude-3-sonnet-20240229-v1-0_1024_1024_summary.json")
with open(individual_path, "r") as f:
individual_data = json.load(f)
with open(summary_path, "r") as f:
summary_data = json.load(f)
# Print abstract metrics
df = pd.DataFrame(individual_data)
summary_metrics = {
"Mannequin": summary_data.get("mannequin"),
"Imply Enter Tokens": summary_data.get("mean_input_tokens"),
"Stddev Enter Tokens": summary_data.get("stddev_input_tokens"),
"Imply Output Tokens": summary_data.get("mean_output_tokens"),
"Stddev Output Tokens": summary_data.get("stddev_output_tokens"),
"Imply TTFT (s)": summary_data.get("results_ttft_s_mean"),
"Imply Inter-token Latency (s)": summary_data.get("results_inter_token_latency_s_mean"),
"Imply Output Throughput (tokens/s)": summary_data.get("results_mean_output_throughput_token_per_s"),
"Accomplished Requests": summary_data.get("results_num_completed_requests"),
"Error Charge": summary_data.get("results_error_rate")
}
print("Claude 3 Sonnet - Efficiency Abstract:n")
for ok, v in summary_metrics.gadgets():
print(f"{ok}: {v}")
The ultimate load take a look at outcomes will look one thing like the next:

As we will see we see the enter parameters that we configured, after which the corresponding outcomes with time to first token(s) and throughput with reference to imply output tokens per second.
In a real-world use case you would possibly use LLMPerf throughout many alternative mannequin suppliers and run exams throughout these platforms. With this software you need to use it holistically to determine the suitable mannequin and deployment stack to your use-case when used at scale.
Further Sources & Conclusion
The whole code for the pattern could be discovered at this related Github repository. For those who additionally need to work with SageMaker endpoints yow will discover a Llama JumpStart deployment load testing pattern here.
All in all load testing and analysis are each essential to making sure that your LLM is performant towards your anticipated visitors earlier than pushing to manufacturing. In future articles we’ll cowl not simply the analysis portion, however how we will create a holistic take a look at with each parts.
As all the time thanks for studying and be at liberty to go away any suggestions and join with me on Linkedln and X.