Saturday, May 30, 2026

Complete observability for Amazon SageMaker AI LLM inference: From GPU utilization to LLM high quality

Share


Deploying giant language fashions (LLMs) at scale on Amazon SageMaker AI Inference makes observability a important pillar of any manufacturing machine studying (ML) technique. Not like typical software program that returns deterministic outputs, LLMs generate variable, free-form responses which might be troublesome to validate with customary metrics. LLM output high quality can change over time as enter distributions shift, and high quality monitoring helps detect these modifications early. For generative AI workloads, observability additionally contains the mannequin serving infrastructure, the place unpredictable token consumption, GPU reminiscence stress, and latency spikes make capability planning and value management a shifting goal.

A complete observability strategy for LLM inference should handle two distinct however complementary dimensions: mannequin serving infrastructure (amount) and LLM high quality. Amount monitoring focuses on the operational well being of inference infrastructure, monitoring request throughput and useful resource utilization. These metrics assist detect bottlenecks, right-size compute assets, and management prices. High quality monitoring focuses on the efficiency of the LLMs themselves, evaluating response accuracy, compliance, and consistency over time.

Most groups construct LLM observability in phases. The primary stage establishes visibility into core operational metrics akin to latency, errors, and useful resource utilization. These alerts affirm the reliability of inference endpoints. The following stage provides LLM high quality by way of sampling and analysis, which floor points akin to mannequin drift, degradation, or surprising habits in generated responses.

With each dimensions in place, you’ll be able to introduce thresholds and automatic alerts that mix infrastructure and high quality alerts. Over time, the observe extends to comparative evaluation throughout fashions and configurations so you’ll be able to repeatedly tune price, efficiency, and output high quality. Amount and high quality metrics are interdependent: an endpoint can seem operationally wholesome whereas producing poor or unsafe responses, or it could ship high-quality outputs whereas working inefficiently on over-provisioned infrastructure. Manufacturing-grade LLM observability emerges when each dimensions are monitored, correlated, and optimized collectively.

This publish demonstrates a complete observability resolution utilizing Amazon Managed Grafana dashboards that gives a holistic view of each high quality and amount for LLMs served on Amazon SageMaker AI endpoints with inference parts.

Workflow structure

For full visibility into LLMs throughout the 2 monitoring dimensions of amount and high quality, we constructed an answer utilizing three core AWS providers, every chosen for a selected function in LLM observability. The next high-level knowledge move diagram exhibits the three core parts: Amazon SageMaker AI endpoints with inference parts, Amazon CloudWatch, and Amazon Managed Grafana.

Architecture diagram showing inference flow from Amazon SageMaker AI endpoints with multiple inference components, through Amazon CloudWatch (Logs and metric namespaces), into Amazon Managed Grafana dashboards.

Amazon SageMaker AI Inference Components function the mannequin internet hosting layer. A single SageMaker AI endpoint can host a number of inference parts, every working a special LLM (for instance, gpt-oss-20b and Qwen2.5-7B-Instruct as proven within the previous structure). Inference parts allow you to deploy, scale, and handle a number of fashions on shared infrastructure whereas protecting per-model isolation for visitors routing, scaling insurance policies, and metric attribution.

Amazon CloudWatch serves because the centralized metrics retailer. It receives two distinct streams of information from every inference part: enhanced metrics and customized high quality metrics. Enhanced metrics are printed robotically by SageMaker AI whenever you allow them on the endpoint configuration. The metrics embrace instance-level, container-level, and per-GPU dimensions, supplying you with granular visibility into invocation counts, latency, error charges, and GPU/CPU utilization per mannequin. Enhanced metrics are logged to the /aws/sagemaker/InferenceComponents/ namespace (for instance, /aws/sagemaker/InferenceComponents/gpt-oss-20b). For particulars, see the Amazon SageMaker AI enhanced metrics documentation and the enhanced metrics deep-dive blog post.

Customized high quality metrics seize LLM output high quality, akin to composite high quality scores, security scores, and analysis latency. These are printed to a separate user-configured CloudWatch namespace at /aws/sagemaker/inference-quality/, which retains high quality alerts cleanly separated from operational metrics. The next desk summarizes the 2 CloudWatch metric namespaces.

CloudWatch Metric Namespace Captures Function
/aws/sagemaker/InferenceComponents/ Enhanced metrics: instance-level, container-level, and per-GPU dimensions Offers granular visibility into invocation counts, latency, error charges, and GPU/CPU utilization per mannequin
/aws/sagemaker/inference-quality/ Customized high quality metrics: composite high quality scores, security scores, and analysis latency Captures LLM output high quality alerts, stored cleanly separated from operational metrics

Amazon Managed Grafana supplies the visualization layer, utilizing CloudWatch as its native data source. On this publish, we describe two devoted dashboards that floor SageMaker AI endpoint LLM amount and high quality metrics, as proven within the following screenshot.

Amazon Managed Grafana Dashboard page snippet showing the list of dashboards available (LLM Quantity monitoring and LLM Quality monitoring).

The Grafana quantity-based dashboard shows GPU reminiscence utilization, CPU utilization, and invocation metrics per inference part. The standard-based Grafana dashboard shows composite high quality scores, security scores, and high quality analysis latency, in contrast throughout fashions, as proven within the following picture. You’ll be able to prolong the Grafana dashboard by creating new dashboards primarily based on what you are promoting or software use circumstances.

Amazon Managed Grafana Dashboard page showing the list of dashboards available (LLM Quantity monitoring and LLM Quality monitoring).

Monitoring amount

Amount monitoring provides you operational visibility into LLMs served on SageMaker AI endpoints. With out it, you’ll be able to lose monitor of visitors patterns, useful resource saturation, price attribution, and scaling habits, all of which instantly impression availability and spend. For multi-model endpoints utilizing inference parts, amount monitoring solutions important operational questions: What number of requests is every mannequin serving? Are GPUs right-sized or over-provisioned? Which mannequin is driving price?

Past infrastructure metrics, amount monitoring helps you assess the operational well being and enterprise impression of your LLM inference parts throughout efficiency and reliability, useful resource utilization, and any enterprise metrics particular to your group. Collectively, these views present the place latency is going on, whether or not price will increase are pushed by visitors development or inefficient GPU allocation, and whether or not scaling insurance policies are responding appropriately to demand.

The next Amazon Managed Grafana dashboard samples put these amount monitoring dimensions into observe throughout three key areas. The primary group of panels covers LLM invocations and latency. As proven within the following pattern Grafana dashboard output, panels show Mannequin Latency as a time-series pattern, Complete Invocations evaluating fashions (for instance, gpt-oss versus Qwen), and Per-Copy Invocations damaged down for every mannequin. These panels assist operators perceive request throughput patterns, establish latency spikes, and evaluate invocation distribution throughout mannequin copies.

Amazon Managed Grafana panels showing Model Latency, Total invocations per model, and Per-Copy Invocations for each model.

The following panel focuses on GPU compute and reminiscence utilization. The next Grafana dashboard samples current GPU Compute proportion and GPU Reminiscence proportion panels for each the fashions (for instance, Qwen and gpt-oss). This cross-model comparability helps ML engineers and web site reliability engineers (SREs) rapidly decide whether or not a efficiency concern is GPU-compute-bound or memory-limited, and whether or not one mannequin is consuming disproportionate assets on shared infrastructure.

Amazon Managed Grafana panels showing GPU Compute utilization per model, and GPU Memory utilization per model.

The third set of panels supplies endpoint utilization and value particulars. The next Cluster Overview and Value Grafana dashboard pattern exhibits Used GPUs versus Free GPUs and Complete Cases to visualise cluster capability, alongside per-model Value/hour panels (for instance, gpt-oss and Qwen). This view exhibits which mannequin is driving price, whether or not GPUs are over-provisioned or saturated, and whether or not auto scaling insurance policies are responding to demand.

Amazon Managed Grafana panels showing Cost per Hour for each model, and the number of GPUs free and in use per instance.

The next desk summarizes the three amount monitoring areas lined within the Grafana dashboard, together with their related metrics and objective:

Metric Sort Dashboard Metric Names Captures Function
Mannequin Invocations & Latency Mannequin Latency, Complete Invocations (gpt-oss vs Qwen), Per-Copy Invocations (gpt-oss), Per-Copy Invocations (Qwen) Request throughput, response time, and per-copy invocation distribution Establish latency spikes, evaluate mannequin throughput, and perceive invocation load balancing throughout copies
GPU Compute & Reminiscence Utilization GPU Compute % (Qwen), GPU Compute % (gpt-oss), GPU Reminiscence % (Qwen), GPU Reminiscence % (gpt-oss) Per-model GPU compute and reminiscence utilization percentages Decide if points are GPU-compute-bound or memory-limited, and detect disproportionate useful resource consumption throughout fashions
Endpoint Utilization & Value Used GPUs / Free GPUs / Cases, Value/hour (gpt-oss), Value/hour (Qwen) Cluster capability, GPU allocation standing, and per-model hourly price attribution Establish price drivers, detect over-provisioned or saturated GPUs, and validate auto scaling responsiveness

Collectively, these dashboards give operators a single pane of glass to correlate price, capability, and utilization throughout fashions served on the endpoint. To arrange these dashboards in your surroundings, observe the AWS samples GitHub repository sample notebook and prolong the answer to create dashboards tailor-made to your group’s necessities.

Monitoring high quality

Whereas amount metrics inform you whether or not the LLM serving infrastructure is wholesome, high quality metrics inform you whether or not LLMs are nonetheless performing as anticipated. LLM efficiency can degrade silently over time due to modifications in enter immediate distributions, idea drift, or shifts in real-world situations. Not like a latency spike or a 500 error, high quality degradation not often triggers conventional alerts.

High quality monitoring addresses this by evaluating mannequin outputs throughout dimensions that matter to the enterprise: response high quality (relevance to person queries, factual accuracy, completeness, and consistency), security and compliance (dangerous content material detection, bias monitoring, privateness compliance, and regulatory adherence), person expertise high quality (helpfulness, readability, applicable tone, and multi-turn dialog coherence), and domain-specific high quality (technical accuracy for specialised domains, quotation high quality for Retrieval Augmented Era (RAG) functions, and code correctness for programming assistants). Collectively, these dimensions assist governance groups implement guardrails, product homeowners monitor user-facing high quality over time, and knowledge scientists pinpoint whether or not a top quality drop is attributable to a selected immediate sample, a mannequin replace, or a knowledge distribution shift.

The next Amazon Managed Grafana dashboard pattern output demonstrates high quality monitoring throughout the SageMaker AI endpoint inference parts (for instance, LLMs gpt-oss-20b and Qwen2.5-7B-Instruct). The instance dashboard tracks 4 high quality scores, every displayed as a time-series line chart with configurable alert thresholds (proven as dashed strains at roughly 85% and 95%). The primary panel exhibits the Composite High quality Rating, an combination well being indicator that mixes high quality dimensions. This metric shows the general high quality pattern over time, making it easy to identify sustained degradation versus intermittent high quality drops which will correlate with particular immediate varieties.

Amazon Managed Grafana panels showing Composite Quality Score per model and Quality Evaluation Latency per model.

The second group of panels tracks particular LLM response high quality metrics: Security Rating, Relevance Rating, and Skilled Tone Rating. Security Rating screens dangerous or non-compliant content material detection. On the dashboard output, this rating stays probably the most steady of all 4 metrics, constantly hovering inside the goal threshold band, which signifies dependable security guardrails throughout each fashions. Relevance Rating measures how effectively LLM responses handle person intent, serving to groups establish immediate classes which will problem an LLM’s comprehension. Skilled Tone Rating evaluates whether or not outputs keep an applicable tone for the deployment context.

Amazon Managed Grafana panels showing Professional Tone Score per model, Safety Score per model, and Relevance Score per model.

These high quality scores are computed utilizing analysis metrics akin to an LLM-as-judge sample with configurable analysis rubrics. In these examples, we use Anthropic Claude Sonnet 4.6 served by way of Amazon Bedrock because the evaluator mannequin, which is permitted below customary Amazon Bedrock service phrases for LLM-as-judge use circumstances. You’ll be able to substitute your individual analysis system, offered you affirm the chosen mannequin’s phrases allow evaluating outputs from different fashions, you confirm the data-residency necessities are met, and also you pin the evaluator mannequin to a selected model so high quality scores stay comparable over time.

At a look, you’ll be able to evaluate high quality throughout LLMs aspect by aspect, figuring out which LLM is extra steady, which high quality dimension is the first danger driver, and whether or not high quality points are intermittent (suggesting sensitivity to particular immediate varieties) or sustained (suggesting mannequin degradation). Past visualization, threshold-based alert guidelines are deployed robotically by way of Grafana Alerting, dimensioned by the inference part in order that alerts fireplace per inference part. When a top quality rating breaches its configured threshold, you’ll be able to obtain these notifications by way of Amazon Easy Notification Service (Amazon SNS), enabling speedy SRE triage. Trendy SRE groups use their current automated triage processes, for instance by integrating these alerts with Slack, PagerDuty, or OpsGenie to chop response occasions to seconds by robotically correlating logs, classifying alert severity, and prioritizing incidents for mitigation.

The next Grafana Alerting dashboard pattern output exhibits threshold-based alert guidelines firing per inference part, with notifications routed to configured channels for rapid SRE triage.

Amazon Managed Grafana alert page snippet showing Low Safety Score Alert Firing, and Low Relevance Score Alert and Low Composite Quality Score Alert as normal.

This view provides governance and product groups the proof wanted to make selections about engineering changes, remediation actions, root trigger evaluation, mannequin swapping, or different refinements. To arrange this dashboard in your surroundings and be taught extra concerning the high quality metrics, observe the AWS samples GitHub repository notebook.

Conclusion

Observability of LLM inference stacks in manufacturing requires greater than monitoring uptime and error charges. As this publish demonstrated, a complete technique should handle two complementary dimensions: amount and high quality. Amount covers the operational well being of your infrastructure, together with GPU utilization, price attribution, scaling habits, and request throughput. High quality covers the continued efficiency of your fashions, together with response relevance, security compliance, factual accuracy, {and professional} tone.

By combining Amazon SageMaker AI endpoint enhanced metrics, Amazon CloudWatch, and Amazon Managed Grafana, you’ll be able to construct a unified observability layer with out customized instrumentation. Enhanced metrics offer you per-model, per-GPU granularity on shared infrastructure. CloudWatch supplies a single metrics retailer for each operational and high quality alerts. Grafana brings it collectively in dashboards that serve totally different stakeholders: SREs monitoring useful resource saturation and scaling, governance groups monitoring security and compliance thresholds, and product homeowners evaluating mannequin high quality aspect by aspect.

To get began, take a look at the AWS samples GitHub repository, which incorporates pattern notebooks to configure enhanced metrics, publish custom quality metrics and alerts, and arrange the Grafana dashboards proven on this publish.


Concerning the authors

Sandeep Raveesh-Babu

Sandeep Raveesh-Babu

Sandeep is a GenAI GTM Specialist Options Architect at AWS. He works with prospects by way of their LLM coaching, LLM inference, and GenAI observability. He focuses on product improvement serving to AWS construct and clear up business challenges within the generative AI house. You’ll be able to join with Sandeep on LinkedIn to study generative AI options.

Jonathan Kola

Jonathan is a Senior Specialist Options Architect, GenAI/ML at AWS.



Source link

Read more

Read More