Wednesday, June 12, 2024

How LotteON constructed a personalised advice system utilizing Amazon SageMaker and MLOps


This publish is co-written with HyeKyung Yang, Jieun Lim, and SeungBum Shim from LotteON.

LotteON goals to be a platform that not solely sells merchandise, but in addition offers a personalised advice expertise tailor-made to your most well-liked life-style. LotteON operates varied specialty shops, together with vogue, magnificence, luxurious, and youngsters, and strives to offer a personalised purchasing expertise throughout all elements of shoppers’ existence.

To reinforce the purchasing expertise of LotteON’s prospects, the advice service improvement crew is constantly enhancing the advice service to offer prospects with the merchandise they’re on the lookout for or could also be taken with on the proper time.

On this publish, we share how LotteON improved their advice service utilizing Amazon SageMaker and machine studying operations (MLOps).

Drawback definition

Historically, the advice service was primarily offered by figuring out the connection between merchandise and offering merchandise that had been extremely related to the product chosen by the shopper. Nonetheless, it was essential to improve the advice service to research every buyer’s style and meet their wants. Due to this fact, we determined to introduce a deep learning-based advice algorithm that may determine not solely linear relationships within the information, but in addition extra complicated relationships. For that reason, we constructed the MLOps structure to handle the created fashions and supply real-time companies.

One other requirement was to construct a steady integration and steady supply (CI/CD) pipeline that may be built-in with GitLab, a code repository utilized by current advice platforms, so as to add newly developed advice fashions and create a construction that may constantly enhance the standard of advice companies via periodic retraining and redistribution of fashions.

Within the following sections, we introduce the MLOps platform that we constructed to offer high-quality suggestions to our prospects and the general means of inferring a deep learning-based advice algorithm (Neural Collaborative Filtering) in actual time and introducing it to LotteON.

Resolution structure

The next diagram illustrates the answer structure for serving Neural Collaborative Filtering (NCF) algorithm-based advice fashions as MLOps. The primary AWS companies used are SageMaker, Amazon EMR, AWS CodeBuild, Amazon Simple Storage Service (Amazon S3), Amazon EventBridge, AWS Lambda, and Amazon API Gateway. We’ve mixed a number of AWS companies utilizing Amazon SageMaker Pipelines and designed the structure with the next elements in thoughts:

  • Knowledge preprocessing
  • Automated mannequin coaching and deployment
  • Actual-time inference via mannequin serving
  • CI/CD construction

MLOps Architecture

The previous structure exhibits the MLOps information move, which consists of three decoupled passes:

  • Code preparation and information preprocessing (blue)
  • Coaching pipeline and mannequin deployment (inexperienced)
  • Actual-time advice inference (brown)

Code preparation and information preprocessing

The preparation and preprocessing section consists of the next steps:

  1. The info scientist publishes the deployment code containing the mannequin and the coaching pipeline to GitLab, which is utilized by LotteON, and Jenkins uploads the code to Amazon S3.
  2. The EMR preprocessing batch runs via Airflow in accordance with the required schedule. The preprocessing information is loaded into MongoDB, which is used as a function retailer together with Amazon S3.

Coaching pipeline and mannequin deployment

The mannequin coaching and deployment section consists of the next steps:

  1. After the coaching information is uploaded to Amazon S3, CodeBuild runs based mostly on the foundations laid out in EventBridge.
  2. The SageMaker pipeline predefined in CodeBuild runs, and sequentially runs steps corresponding to preprocessing together with provisioning, mannequin coaching, and mannequin registration.
  3. When coaching is full (via the Lambda step), the deployed mannequin is up to date to the SageMaker endpoint.

Actual-time advice inference

The inference section consists of the next steps:

  1. The consumer software makes an inference request to the API gateway.
  2. The API gateway sends the request to Lambda, which makes an inference request to the mannequin within the SageMaker endpoint to request a listing of suggestions.
  3. Lambda receives the checklist of suggestions and offers them to the API gateway.
  4. The API gateway offers the checklist of suggestions to the consumer software utilizing the Suggestion API.

Suggestion mannequin utilizing NCF

NCF is an algorithm based mostly on a paper offered on the Worldwide World Broad Net Convention in 2017. It’s an algorithm that covers the restrictions of linear matrix factorization, which is commonly utilized in current advice methods, with collaborative filtering based on the neural net. By including non-linearity via the neural web, the authors had been in a position to mannequin a extra complicated relationship between customers and objects. The info for NCF is interplay information the place customers react to objects, and the general construction of the mannequin is proven within the following determine (supply:

NCF Model

Though NCF has a easy mannequin structure, it has proven an excellent efficiency, which is why we selected it to be the prototype for our MLOps platform. For extra details about the mannequin, discuss with the paper Neural Collaborative Filtering.

Within the following sections, we talk about how this resolution helped us construct the aforementioned MLOps elements:

  • Knowledge preprocessing
  • Automating mannequin coaching and deployment
  • Actual-time inference via mannequin serving
  • CI/CD construction

MLOps element 1: Knowledge preprocessing

For NCF, we used user-item interplay information, which requires important assets to course of the uncooked information collected on the software and remodel it right into a type appropriate for studying. With Amazon EMR, which offers totally managed environments like Apache Hadoop and Spark, we had been in a position to course of information quicker.

The info preprocessing batches had been created by writing a shell script to run Amazon EMR via AWS Command Line Interface (AWS CLI) instructions, which we registered to Airflow to run at particular intervals. When the preprocessing batch was full, the coaching/take a look at information wanted for coaching was partitioned based mostly on runtime and saved in Amazon S3. The next is an instance of the AWS CLI command to run Amazon EMR:

aws emr create-cluster --release-label emr-6.0.0 
    --name "CLUSTER_NAME" 
    --applications Title=Hadoop Title=Hive Title=Spark 
    --tags 'Title=EMR-DATA-PREP' 'Proprietor=MODEL' 'Service=LOTTEON' 
    --ec2-attributes '{"KeyName":"keyname","InstanceProfile":"DefaultRole","ServiceAccessSecurityGroup":"sg-xxxxxxxxxxxxxx","SubnetId":"subnet- xxxxxxxxxxxxxx ","EmrManagedSlaveSecurityGroup":"sg- xxxxxxxxxxxxxx ","EmrManagedMasterSecurityGroup":"sg-xxxxxxxxxxxxxx "}' 
--instance-groups '[{"InstanceCount":1,"InstanceGroupType":"MASTER","InstanceType":"r5.xlarge","Name":"Master Instance Group"},{"InstanceCount":2,"InstanceGroupType":"CORE","InstanceType":"r5.xlarge","Name":"Core Instance Group"},{"InstanceCount":2,"BidPrice":"OnDemandPrice","InstanceGroupType":"TASK","InstanceType":"r5.xlarge","Name":"Task Instance Group"}]' 
    --service-role EMR_DefaultRole 
    --region ap-northeast-2 
    --steps Kind=CUSTOM_JAR,Title=DATA_PREP,ActionOnFailure=CONTINUE,Jar=s3://ap-northeast-2.elasticmapreduce/libs/script-runner/script-runner.jar,Args=["s3://bucket/prefix/"]  

MLOps element 2: Automated coaching and deployment of fashions

On this part, we talk about the elements of the mannequin coaching and deployment pipeline.

Occasion-based pipeline automation

After the preprocessing batch was full and the coaching/take a look at information was saved in Amazon S3, this occasion invoked CodeBuild and ran the coaching pipeline in SageMaker. Within the course of, the model of the outcome file of the preprocessing batch was recorded, enabling dynamic management of the model and administration of the pipeline run historical past. We used EventBridge, Lambda, and CodeBuild to attach the info preprocessing steps run by Amazon EMR and the SageMaker studying pipeline on an event-based foundation.

EventBridge is a serverless service that implements guidelines to obtain occasions and direct them to locations, based mostly on the occasion patterns and locations you determine. The preliminary position of EventBridge in our configuration was to invoke a Lambda perform on the S3 object creation occasion when the preprocessing batch saved the coaching dataset in Amazon S3. The Lambda perform dynamically modified the buildspec.yml file, which is indispensable when CodeBuild runs. These modifications encompassed the trail, model, and partition info of the info that wanted coaching, which is essential for finishing up the coaching pipeline. The following position of EventBridge was to dispatch occasions, instigated by the alteration of the buildspec.yml file, resulting in operating CodeBuild.

CodeBuild was accountable for constructing the supply code the place the SageMaker pipeline was outlined. All through this course of, it referred to the buildspec.yml file and ran processes corresponding to cloning the supply code and putting in the libraries wanted to construct from the trail outlined within the file. The Undertaking Construct tab on the CodeBuild console allowed us to evaluate the construct’s success and failure historical past, together with a real-time log of the SageMaker pipeline’s efficiency.

SageMaker pipeline for coaching

SageMaker Pipelines helps you outline the steps required for ML companies, corresponding to preprocessing, coaching, and deployment, utilizing the SDK. Every step is visualized inside SageMaker Studio, which may be very useful for managing fashions, and you may as well handle the historical past of skilled fashions and endpoints that may serve the fashions. You may also arrange steps by attaching conditional statements to the outcomes of the steps, so you’ll be able to undertake solely fashions with good retraining outcomes or put together for studying failures. Our pipeline contained the next high-level steps:

  • Mannequin coaching
  • Mannequin registration
  • Mannequin creation
  • Mannequin deployment

Every step is visualized within the pipeline in Amazon SageMaker Studio, and you may as well see the outcomes or progress of every step in actual time, as proven within the following screenshot.

SageMaker Pipeline

Let’s stroll via the steps from mannequin coaching to deployment, utilizing some code examples.

Prepare the mannequin

First, you outline a PyTorch Estimator to make use of for coaching and a coaching step. This requires you to have the coaching code (for instance, prepared prematurely and go the situation of the code as an argument of the source_dir. The coaching step runs the coaching code you go as an argument of the entry_point. By default, the coaching is finished by launching the container within the occasion you specify, so that you’ll have to go within the path to the coaching Docker picture for the coaching setting you’ve developed. Nonetheless, if you happen to specify the framework on your estimator right here, you’ll be able to go within the model of the framework and Python model to make use of, and it’ll robotically fetch the version-appropriate container picture from Amazon ECR.

If you’re finished defining your PyTorch Estimator, it is advisable to outline the steps concerned in coaching it. You are able to do this by passing the PyTorch Estimator you outlined earlier as an argument and the situation of the enter information. If you go within the location of the enter information, the SageMaker coaching job will obtain the practice and take a look at information to a selected path within the container utilizing the format /decide/ml/enter/information/ (for instance, /decide/ml/enter/information/practice).

As well as, when defining a PyTorch Estimator, you should utilize metric definitions to watch the training metrics generated whereas the mannequin is being skilled with Amazon CloudWatch. You may also specify the trail the place the outcomes of the mannequin artifacts after coaching are saved by specifying estimator_output_path, and you should utilize the parameters required for mannequin coaching by specifying model_hyperparameters. See the next code:

from sagemaker.pytorch import PyTorch
        {'Name': 'HR', 'Regex': 'HR=(.*?);'},
        {'Name': 'NDCG', 'Regex': 'NDCG=(.*?);'},
        {'Name': 'Loss', 'Regex': 'Loss=(.*?);'}
estimator_output_path = f's3://{bucket}/{prefix}'
model_hyperparameter = {'epochs': 10, 
                    'lr': 0.001,
                    'batch_size': 256,
                    'top_k' : 10,
                    'dropout' : 0.3,
                    'factor_num' : 32,
                    'num_layers' : 3
s3_code_uri = 's3://code_location/supply.tar.gz'

host_estimator = PyTorch(
    source_dir = s3_code_uri, 
    output_path = estimator_output_path, 
    session = pipeline_session,
    metric_definitions = metric_definitions

from sagemaker.inputs import TrainingInput
from sagemaker.workflow.steps import TrainingStep
data_loc = f's3://{bucket}/{prefix}'
step_train = TrainingStep(
    title= "NCF-Coaching",
        "practice": TrainingInput(s3_data=data_loc),
        "take a look at": TrainingInput(s3_data=data_loc),        

Create a mannequin bundle group

The subsequent step is to create a mannequin bundle group to handle your skilled fashions. By registering skilled fashions in mannequin packages, you’ll be able to handle them by model, as proven within the following screenshot. This info permits you to reference earlier variations of your fashions at any time. This course of solely must be finished one time once you first practice a mannequin, and you may proceed so as to add and replace fashions so long as they declare the identical group title.

Model Packages

See the next code:

import boto3
sm_client = boto3.consumer("sagemaker")
model_package_group_input_dict = {
    "ModelPackageGroupName" : model_package_group_name,
    "ModelPackageGroupDescription" : "Mannequin Package deal Group"
response = sm_client.list_model_package_groups(NameContains=model_package_group_name)
if len(response['ModelPackageGroupSummaryList']) == 0:
create_model_pacakge_group_response = sm_client.create_model_package_group(**model_package_group_input_dict)

Add a skilled mannequin to a mannequin bundle group

The subsequent step is so as to add a skilled mannequin to the mannequin bundle group you created. Within the following code, once you declare the Mannequin class, you get the results of the earlier mannequin coaching step, which creates a dependency between the steps. A step with a declared dependency can solely be run if the earlier step succeeds. Nonetheless, you should utilize the DependsOn choice to declare a dependency between steps even when the info is just not causally associated.

After the skilled mannequin is registered within the mannequin bundle group, you should utilize this info to handle and observe future mannequin variations, create a real-time SageMaker endpoint, run a batch transform job, and extra.

from sagemaker.workflow.model_step import ModelStep
from sagemaker.mannequin import Mannequin

inference_image_uri = ''
mannequin = Mannequin(
    model_data =,

register_model_step_args = mannequin.register(

step_model_registration = ModelStep(

Create a SageMaker mannequin

To create a real-time endpoint, an endpoint configuration and model is required. To create a model, you want two fundamental components: an S3 deal with the place the mannequin’s artifacts are saved, and the trail to the inference Docker picture that can run the mannequin’s artifacts.

When making a SageMaker mannequin, you will need to take note of the next steps:

  • Present the results of the mannequin coaching step,, which might be transformed to the S3 path the place the mannequin artifact is saved, as an argument of the model_data.
  • Since you specified the PyTorchModel class, framework_version, and py_version, you utilize this info to get the trail to the inference Docker picture via Amazon ECR. That is the inference Docker picture that’s used for mannequin deployment. Ensure that to enter the identical PyTorch framework, Python model, and different particulars that you just used to coach the mannequin. This implies holding the identical PyTorch and Python variations for coaching and inference.
  • Present the because the entry level script to deal with invocations.

This step will set a dependency on the mannequin bundle registration step you outlined through the DependsOn possibility.

from sagemaker.pytorch.mannequin import PyTorchModel
from sagemaker.workflow.model_step import ModelStep

s3_code_uri = 's3://code_location/supply.tar.gz'

model_inference = PyTorchModel(
        title = model_name,
        model_data =, 
image_uri= image_uri,
        source_dir = s3_code_uri,
step_model_create = ModelStep(

Create a SageMaker endpoint

Now it is advisable to outline an endpoint configuration based mostly on the created mannequin, which is able to create an endpoint when deployed. As a result of the SageMaker Python SDK doesn’t assist the step associated to deployment (as of this writing), you should utilize Lambda to register that step. Cross the required arguments to Lambda, corresponding to instance_type, and use that info to create the endpoint configuration first. Since you’re calling the endpoint based mostly on endpoint_name, it is advisable to guarantee that variable is outlined with a novel title. Within the following Lambda perform code, based mostly on the endpoint_name, you replace the mannequin if the endpoint exists, and deploy a brand new one if it doesn’t:

import json
import boto3
def lambda_handler(occasion, context):
    sm_client = boto3.consumer("sagemaker")
    model_name = occasion["model_name"]
    endpoint_config_name = occasion["endpoint_config_name"]
    endpoint_name = occasion["endpoint_name"]
    instance_type = occasion["instance_type"]
    create_endpoint_config_response = sm_client.create_endpoint_config(
                "InstanceType": instance_type,
                "InitialVariantWeight": 1,
                "InitialInstanceCount": 1,
                "ModelName": model_name,
                "VariantName": "AllTraffic",
    print(f"create_endpoint_config_response: {create_endpoint_config_response}")
    existing_endpoints = sm_client.list_endpoints(NameContains=endpoint_name)['Endpoints']
    if len(existing_endpoints["Endpoints"]) > 0:
            EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
            EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
    return {"statusCode": 200, "physique": json.dumps("Endpoint Created Efficiently")}

To get the Lambda perform right into a step within the SageMaker pipeline, you should utilize the SDK related to the Lambda perform. By passing the situation of the Lambda perform supply as an argument of the perform, you’ll be able to robotically register and use the perform. Together with this, you’ll be able to outline LambdaStep and go it the required arguments. See the next code:

from sagemaker.lambda_helper import Lambda
from sagemaker.workflow.lambda_step import (LambdaStep, LambdaOutput, LambdaOutputTypeEnum)
deploy_model_func = Lambda(
output_param_1 = LambdaOutput(output_name="statusCode", output_type=LambdaOutputTypeEnum.String)
output_param_2 = LambdaOutput(output_name="physique", output_type=LambdaOutputTypeEnum.String)

step_deploy_lambda = LambdaStep(
        "endpoint_config_name": endpoint_config_name,
        "endpoint_name": endpoint_name,
        "instance_type": 'ml.p3.2xlarge',       
    outputs=[output_param_1, output_param_2]

Create a SageMaker pipeline

Now you’ll be able to create a pipeline utilizing the steps you outlined. You are able to do this by defining a reputation for the pipeline and passing within the steps for use within the pipeline as arguments. After that, you’ll be able to run the outlined pipeline via the beginning perform. See the next code:

from sagemaker.workflow.pipeline import Pipeline
pipeline = Pipeline(
    steps=[step_train, step_model_registration, step_model_create, step_deploy_lambda],


After this course of is full, an endpoint is created with the skilled mannequin and is prepared to be used based mostly on the deep learning-based mannequin.

MLOps element 3: Actual-time inference with mannequin serving

Now let’s see invoke the mannequin in actual time from the created endpoint, which may also be accessed utilizing the SageMaker SDK. The next code is an instance of getting real-time inference values for enter values from an endpoint deployed through the invoke_endpoint perform. The options you go as arguments to the physique are handed as enter to the endpoint, which returns the inference leads to actual time.

import boto3
sagemaker_runtime = boto3.consumer("sagemaker-runtime")
response = sagemaker_runtime.invoke_endpoint(
                    Physique=bytes("'options': '{"person": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], "merchandise": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]}'}")

After we configured the inference perform, we had it return the objects within the order that the person is almost certainly to love among the many objects handed in. The previous instance returns objects from 1–25 so as of probability of being favored by the person at index 0.

We added enterprise logic to the function, configured it in Lambda, and linked it with an API gateway to implement the API’s capacity to return really helpful objects in actual time. We then performed efficiency testing of the net service. We load examined it with Locust utilizing 5 g4dn.2xlarge situations and located that it could possibly be reliably served in an setting with 1,000 TPS.

MLOps element 4: CI/CD construction

A CI/CD construction is a basic a part of DevOps, and can also be an essential a part of organizing an MLOps setting. AWS CodeCommit, AWS CodeBuild, AWS CodeDeploy, and AWS CodePipeline collectively present all of the performance you want for CI/CD, from code shaping to deployment, construct, and batch administration. The companies aren’t solely linked to the identical code collection, but in addition to different companies corresponding to GitHub and Jenkins, so you probably have an current CI/CD construction, you should utilize them individually to fill within the gaps. Due to this fact, we expanded our CI/CD construction by linking solely the CodeBuild configuration described earlier to our current CI/CD pipeline.

We linked our SageMaker notebooks with GitLab for code administration, and once we had been finished, we replicated them to Amazon S3 through Jenkins. After that, we set the S3 path to the default repository path of the NCF CodeBuild undertaking as described earlier, in order that we may construct the undertaking with CodeBuild.


Up to now, we’ve seen the end-to-end means of configuring an MLOps setting utilizing AWS companies and offering real-time inference companies based mostly on deep studying fashions. By configuring an MLOps setting, we’ve created a basis for offering high-quality companies based mostly on varied algorithms to our prospects. We’ve additionally created an setting the place we will rapidly proceed with prototype improvement and deployment. The NCF we developed with the prototyping algorithm was additionally in a position to obtain good outcomes when it was put into service. Sooner or later, the MLOps platform may also help us rapidly develop and experiment with fashions that match LotteON information to offer our prospects with a progressively higher-quality advice expertise.

Utilizing SageMaker along side varied AWS companies has given us many benefits in creating and working our companies. As mannequin builders, we didn’t have to fret about configuring the setting settings for often used packages and deep learning-related frameworks as a result of the setting settings had been configured for every library, and we felt that the connectivity and scalability between AWS companies utilizing AWS CLI instructions and associated SDKs had been nice. Moreover, as a service operator, it was good to trace and monitor the companies we had been operating as a result of CloudWatch linked the logging and monitoring of every service.

You may also try the NCF and MLOps configuration for hands-on follow on our GitHub repo (Korean).

We hope this publish will assist you configure your MLOps setting and supply real-time companies utilizing AWS companies.

Concerning the Authors

SeungBum Shim is an information engineer within the Lotte E-commerce Suggestion Platform Growth Staff, accountable for discovering methods to make use of and enhance recommendation-related merchandise via LotteON information evaluation, and creating MLOps pipelines and ML/DL advice fashions.

HyeKyung Yang is a analysis engineer within the Lotte E-commerce Suggestion Platform Growth Staff and is answerable for creating ML/DL advice fashions by analyzing and using varied information and creating a dynamic A/B take a look at setting.

Jieun Lim is an information engineer within the Lotte E-commerce Suggestion Platform Growth Staff and is answerable for working LotteON’s personalised advice system and creating personalised advice fashions and dynamic A/B take a look at environments.

Jesam Kim is an AWS Options Architect and helps enterprise prospects undertake and troubleshoot cloud applied sciences and offers architectural design and technical assist to handle their enterprise wants and challenges, particularly in AIML areas corresponding to advice companies and generative AI.

Gonsoo Moon is an AWS AI/ML Specialist Options Architect and offers AI/ML technical assist. His fundamental position is to collaborate with prospects to resolve their AI/ML issues based mostly on varied use instances and manufacturing expertise in AI/ML.

Source link

Read more

Read More