This submit was written with Dominic Catalano from Anyscale.
Organizations constructing and deploying large-scale AI fashions typically face vital infrastructure challenges that may immediately impression their backside line: unstable coaching clusters that fail mid-job, inefficient useful resource utilization driving up prices, and sophisticated distributed computing frameworks requiring specialised experience. These components can result in unused GPU hours, delayed tasks, and annoyed information science groups. This submit demonstrates how one can handle these challenges by offering a resilient, environment friendly infrastructure for distributed AI workloads.
Amazon SageMaker HyperPod is a purpose-built persistent generative AI infrastructure optimized for machine studying (ML) workloads. It offers strong infrastructure for large-scale ML workloads with high-performance {hardware}, so organizations can construct heterogeneous clusters utilizing tens to 1000’s of GPU accelerators. With nodes optimally co-located on a single backbone, SageMaker HyperPod reduces networking overhead for distributed coaching. It maintains operational stability by way of steady monitoring of node well being, routinely swapping defective nodes with wholesome ones and resuming coaching from essentially the most lately saved checkpoint, all of which will help save as much as 40% of coaching time. For superior ML customers, SageMaker HyperPod permits SSH entry to the nodes within the cluster, enabling deep infrastructure management, and permits entry to SageMaker tooling, together with Amazon SageMaker Studio, MLflow, and SageMaker distributed coaching libraries, together with assist for numerous open-source coaching libraries and frameworks. SageMaker Versatile Coaching Plans complement this by enabling GPU capability reservation as much as 8 weeks upfront for durations as much as 6 months.
The Anyscale platform integrates seamlessly with SageMaker HyperPod when utilizing Amazon Elastic Kubernetes Service (Amazon EKS) because the cluster orchestrator. Ray is the main AI compute engine, providing Python-based distributed computing capabilities to deal with AI workloads starting from multimodal AI, information processing, mannequin coaching, and mannequin serving. Anyscale unlocks the ability of Ray with complete tooling for developer agility, vital fault tolerance, and an optimized model known as RayTurbo, designed to ship main cost-efficiency. Via a unified management airplane, organizations profit from simplified administration of advanced distributed AI use circumstances with fine-grained management throughout {hardware}.
The mixed answer offers in depth monitoring by way of SageMaker HyperPod real-time dashboards monitoring node well being, GPU utilization, and community site visitors. Integration with Amazon CloudWatch Container Insights, Amazon Managed Service for Prometheus, and Amazon Managed Grafana delivers deep visibility into cluster efficiency, complemented by Anyscale’s monitoring framework, which offers built-in metrics for monitoring Ray clusters and the workloads that run on them.
This submit demonstrates find out how to combine the Anyscale platform with SageMaker HyperPod. This mix can ship tangible enterprise outcomes: lowered time-to-market for AI initiatives, decrease complete price of possession by way of optimized useful resource utilization, and elevated information science productiveness by minimizing infrastructure administration overhead. It’s superb for Amazon EKS and Kubernetes-focused organizations, groups with large-scale distributed coaching wants, and people invested within the Ray ecosystem or SageMaker.
Answer overview
The next structure diagram illustrates SageMaker HyperPod with Amazon EKS orchestration and Anyscale.

The sequence of occasions on this structure is as follows:
- A consumer submits a job to the Anyscale Management Airplane, which is the principle user-facing endpoint.
- The Anyscale Management Airplane communicates this job to the Anyscale Operator inside the SageMaker HyperPod cluster within the SageMaker HyperPod digital personal cloud (VPC).
- The Anyscale Operator, upon receiving the job, initiates the method of making the mandatory pods by reaching out to the EKS management airplane.
- The EKS management airplane orchestrates creation of a Ray head pod and employee pods. These pods characterize a Ray cluster, operating on SageMaker HyperPod with Amazon EKS.
- The Anyscale Operator submits the job by way of the top pod, which serves as the first coordinator for the distributed workload.
- The pinnacle pod distributes the workload throughout a number of employee pods, as proven within the hierarchical construction within the SageMaker HyperPod EKS cluster.
- Employee pods execute their assigned duties, doubtlessly accessing required information from the storage providers – corresponding to Amazon Simple Storage Service (Amazon S3), Amazon Elastic File System (Amazon EFS), or Amazon FSx for Lustre – within the consumer VPC.
- All through the job execution, metrics and logs are printed to Amazon CloudWatch and Amazon Managed Service for Prometheus or Amazon Managed Grafana for observability.
- When the Ray job is full, the job artifacts (closing mannequin weights, inference outcomes, and so forth) are saved to the designated storage service.
- Job outcomes (standing, metrics, logs) are despatched by way of the Anyscale Operator again to the Anyscale Management Airplane.
This circulate reveals distribution and execution of user-submitted jobs throughout the accessible computing assets, whereas sustaining monitoring and information accessibility all through the method.
Conditions
Earlier than you start, you could have the next assets:
Arrange Anyscale Operator
Full the next steps to arrange the Anyscale Operator:
- In your workspace, obtain the aws-do-ray repository:
This repository has the instructions wanted to deploy the Anyscale Operator on a SageMaker HyperPod cluster. The aws-do-ray undertaking goals to simplify the deployment and scaling of distributed Python software utilizing Ray on Amazon EKS or SageMaker HyperPod. The
aws-do-raycontainer shell is supplied with intuitive motion scripts and comes preconfigured with handy shortcuts, which save in depth typing and enhance productiveness. You’ll be able to optionally use these options by constructing and opening a bash shell within the container with the directions within theaws-do-rayREADME, or you possibly can proceed with the next steps. - For those who proceed with these steps, make sure that your atmosphere is correctly arrange:
- Confirm your connection to the HyperPod cluster:
- Acquire the title of the EKS cluster on the SageMaker HyperPod console. In your cluster particulars, you will notice your EKS cluster orchestrator.

- Replace
kubeconfigto connect with the EKS cluster:The next screenshot reveals an instance output.

If the output signifies InProgress as a substitute of Handed, await the deep well being checks to complete.

- Acquire the title of the EKS cluster on the SageMaker HyperPod console. In your cluster particulars, you will notice your EKS cluster orchestrator.
- Assessment the
env_varsfile. Replace the variableAWS_EKS_HYPERPOD_CLUSTER. You’ll be able to depart the values as default or make desired modifications. - Deploy your necessities:
This creates the
anyscalenamespace, installs Anyscale dependencies, configures login to your Anyscale account (this step will immediate you for extra verification as proven within the following screenshot), provides the anyscale helm chart, installs the ingress-nginx controller, and at last labels and taints SageMaker HyperPod nodes for the Anyscale employee pods.
- Create an EFS file system:
Amazon EFS serves because the shared cluster storage for the Anyscale pods.
On the time of writing, Amazon EFS and S3FS are the supported file system choices when utilizing Anyscale and SageMaker HyperPod setups with Ray on AWS. Though FSx for Lustre just isn’t supported with this setup, you should utilize it with KubeRay on SageMaker HyperPod EKS. - Register an Anyscale Cloud:
This registers a self-hosted Anyscale Cloud into your SageMaker HyperPod cluster. By default, it makes use of the worth of ANYSCALE_CLOUD_NAME within the
env_varsfile. You’ll be able to modify this discipline as wanted. At this level, it is possible for you to to see your registered cloud on the Anyscale console. - Deploy the Kubernetes Anyscale Operator:
This command installs the Anyscale Operator within the
anyscalenamespace. The Operator will begin posting well being checks to the Anyscale Management Airplane.To see the Anyscale Operator pod, run the next command:
kubectl get pods -n anyscale
Submit coaching job
This part walks by way of a easy coaching job submission. The instance implements distributed coaching of a neural community for Trend MNIST classification utilizing the Ray Practice framework on SageMaker HyperPod with Amazon EKS orchestration, demonstrating find out how to use the AWS managed ML infrastructure mixed with Ray’s distributed computing capabilities for scalable mannequin coaching.Full the next steps:
- Navigate to the
jobslisting. This comprises folders for accessible instance jobs you possibly can run. For this walkthrough, go to thedt-pytorchlisting containing the coaching job. - Configure the required atmosphere variables:
- Create Anyscale compute configuration:
./1.create-compute-config.sh - Submit the coaching job:
./2.submit-dt-pytorch.shThis makes use of the job configuration laid out in job_config.yaml. For extra data on the job config, discuss with JobConfig. - Monitor the deployment. You will note the newly created head and employee pods within the
anyscalenamespace.kubectl get pods -n anyscale - View the job standing and logs on the Anyscale console to observe your submitted job’s progress and output.

Clear up
To scrub up your Anyscale cloud, run the next command:
To delete your SageMaker HyperPod cluster and related assets, delete the CloudFormation stack if that is the way you created the cluster and its assets.
Conclusion
This submit demonstrated find out how to arrange and deploy the Anyscale Operator on SageMaker HyperPod utilizing Amazon EKS for orchestration.SageMaker HyperPod and Anyscale RayTurbo present a extremely environment friendly, resilient answer for large-scale distributed AI workloads: SageMaker HyperPod delivers strong, automated infrastructure administration and fault restoration for GPU clusters, and RayTurbo accelerates distributed computing and optimizes useful resource utilization with no code modifications required. By combining the high-throughput, fault-tolerant atmosphere of SageMaker HyperPod with RayTurbo’s quicker information processing and smarter scheduling, organizations can prepare and serve fashions at scale with improved reliability and important price financial savings, making this stack superb for demanding duties like giant language mannequin pre-training and batch inference.
For extra examples of utilizing SageMaker HyperPod, discuss with the Amazon EKS Support in Amazon SageMaker HyperPod workshop and the Amazon SageMaker HyperPod Developer Guide. For data on how clients are utilizing RayTurbo, discuss with RayTurbo.
Concerning the authors
Sindhura Palakodety is a Senior Options Architect at AWS and Single-Threaded Chief (STL) for ISV Generative AI, the place she is devoted to empowering clients in growing enterprise-scale, Properly-Architected options. She focuses on generative AI and information analytics domains, serving to organizations use modern applied sciences for transformative enterprise outcomes.
Mark Vinciguerra is an Affiliate Specialist Options Architect at AWS based mostly in New York. He focuses on generative AI coaching and inference, with the objective of serving to clients architect, optimize, and scale their workloads throughout numerous AWS providers. Previous to AWS, he went to Boston College and graduated with a level in Pc Engineering.
Florian Gauter is a Worldwide Specialist Options Architect at AWS, based mostly in Hamburg, Germany. He focuses on AI/ML and generative AI options, serving to clients optimize and scale their AI/ML workloads on AWS. With a background as a Knowledge Scientist, Florian brings deep technical experience to assist organizations design and implement subtle ML options. He works intently with clients worldwide to remodel their AI initiatives and maximize the worth of their ML investments on AWS.
Alex Iankoulski is a Principal Options Architect within the Worldwide Specialist Group at AWS. He focuses on orchestration of AI/ML workloads utilizing containers. Alex is the writer of the do-framework and a Docker captain who loves making use of container applied sciences to speed up the tempo of innovation whereas fixing the world’s greatest challenges. Over the previous 10 years, Alex has labored on serving to clients do extra on AWS, democratizing AI and ML, combating local weather change, and making journey safer, healthcare higher, and power smarter.
Anoop Saha is a Senior GTM Specialist at AWS specializing in generative AI mannequin coaching and inference. He’s partnering with high basis mannequin builders, strategic clients, and AWS service groups to allow distributed coaching and inference at scale on AWS and lead joint GTM motions. Earlier than AWS, Anoop has held a number of management roles at startups and enormous firms, primarily specializing in silicon and system structure of AI infrastructure.
Dominic Catalano is a Group Product Supervisor at Anyscale, the place he leads product improvement throughout AI/ML infrastructure, developer productiveness, and enterprise safety. His work focuses on distributed techniques, Kubernetes, and serving to groups run AI workloads at scale.

