When I started working with AWS SageMaker, one of the most common questions was: “Which inference type should I choose for my model?” SageMaker offers four different options, and at first glance, the differences between them aren’t always obvious. Let’s break down when and which approach to use.

What is Payload and Why Does It Matter?

Before diving into inference types, it’s important to understand the term Payload — this is the size of data you send to the model for processing. For example:

  • For classifying a single 224x224 pixel image in JSON format — about 150 KB

  • For analyzing a 10-page text document — about 50–100 KB

  • For processing a video fragment — this could be tens or hundreds of megabytes

Different inference types have different payload limitations because architecturally they solve different problems: some are optimized for fast responses with small data, others for long-duration processing of large volumes.

Real-Time Inference: For Instant Responses

Technical Specifications:

  • Payload: up to 25 MB

  • Processing time: up to 60 seconds (8 minutes for streaming)

  • Time to start processing: instant (endpoint is already running)

  • Instance management: you manually choose the type and number of instances (e.g., ml.m5.xlarge)

  • Pricing model: pay for instance uptime 24/7, regardless of request volume

  • Persistent endpoint: REST API is always available

When to use: Real-time inference is your choice for production applications where users expect immediate responses.

Practical example: E-commerce recommendation system. When a user views a product, you need to instantly show personalized recommendations. Even a few seconds of delay can lead to lost conversions. A real-time endpoint processes each request in milliseconds and can scale to handle thousands of concurrent users.

Important note: You pay for instances constantly, even when there are no requests. For example, if you launched ml.m5.xlarge ($0.23/hour), you’ll pay about $166 per month, even if there were no requests on weekends.

Serverless Inference: Pay Only for Usage

Technical Specifications:

  • Payload: up to 4 MB (smaller due to serverless architecture)

  • Processing time: up to 60 seconds

  • Time to start processing:

Cold start (first request): 10–30 seconds (SageMaker spins up an instance) Warm start (subsequent requests): instant (if instance is still active) Important: Warm instances transition to cold start after approximately 10–15 minutes of inactivity

  • Instance management: fully automatic, you don’t manage instances

  • Pricing model: pay only for request processing time (by millisecond) + memory volume

  • Auto-scaling: from 0 to the required number of instances

When to use: Ideal for unpredictable traffic or when your service is used irregularly.

Practical example: Internal tool for contract analysis at a law firm. Lawyers upload documents several times a day, but not constantly. With serverless inference, you don’t pay for idle instances at night or on weekends. When a request arrives, SageMaker automatically allocates resources, processes the document, and releases them.

Important note: The first request after idle time will take 10–30 seconds (cold start), which is critical to consider for user experience. If your service receives requests less frequently than once every 10–15 minutes, every request will experience a cold start. You pay for actual execution time — if processing took 2 seconds, you pay only for those 2 seconds.

Batch Transform: Bulk Data Processing

Technical Specifications:

  • Payload: datasets sized in gigabytes (virtually no limits)

  • Processing time: from minutes to days

  • Time to start processing: 5–10 minutes (time to launch instances and load model)

  • Instance management: you specify the type and number of instances for the job

  • Pricing model: pay only for batch job execution time (from start to completion)

  • No persistent endpoint: instances are created for the task and deleted after completion

When to use: When you need to process large volumes of data and execution time isn’t critical.

Practical example: Nightly image processing for content moderation. You have 100,000 images uploaded by users during the day that need to be checked for violations. Launch a batch transform job at 2 AM with 5 ml.p3.2xlarge instances, which processes all images in 3 hours, saves results to S3, and automatically terminates. You pay only for these 3 hours of work x 5 instances = 15 instance-hours.

Important note: No payload size limit, as data is read in batches from S3. Batch Transform is cost-optimal for large volumes because instances are automatically deleted after completion.

Asynchronous Inference: For Heavy Tasks

Technical Specifications:

  • Payload: up to 1 GB (large size because data is uploaded through S3)

  • Processing time: up to 1 hour per request

  • Time to start processing:

  • If endpoint is active: instant (request goes to queue)

  • If endpoint is at 0: 2–5 minutes (time to scale up)

  • Instance management: you choose instance types and configure auto-scaling (including scale-to-zero)

  • Pricing model: pay for instance uptime, but can configure scale-to-zero when there are no requests

  • Request queue: built-in queuing system (Amazon SQS)

When to use: For tasks requiring significant computational resources and time.

Practical example: AI video generation. A user uploads parameters for creating a 3D animation or video presentation. The process can take 20–40 minutes. With asynchronous inference, the request is placed in a queue, the user receives a notification ID, and when processing is complete — a webhook notifies the frontend. If there are no requests for more than 15 minutes, the endpoint automatically scales to 0, saving your budget.

Important note: You pay for active instance time. If you configured scale-to-zero with a 15-minute idle period, and requests arrive once per hour, out of 24 hours you’ll pay for approximately 2–3 hours of actual work + 15 minutes of idle time after each request.

Comparison Table: Management and Pricing

Management and Pricing
Management and Pricing

How to Choose?

Ask yourself four questions:

1. How long is the user willing to wait for a response?

  • Milliseconds/seconds → Real-Time

  • Several seconds (willing to wait) → Serverless

  • Minutes/hours → Asynchronous

  • Doesn’t matter → Batch Transform

2. What’s the traffic pattern?

  • Constant, predictable → Real-Time

  • Unpredictable/rare → Serverless

  • Periodic (once a day/week) → Batch Transform

  • Peak loads with long processing → Asynchronous

3. What’s the data size?

  • <4 MB → any option

  • 4–25 MB → Real-Time or Asynchronous

  • 25 MB — 1 GB → Asynchronous

  • 1 GB → Batch Transform

4. Are you willing to pay for idle time?

  • Yes, need maximum speed → Real-Time

  • No, irregular traffic → Serverless or Asynchronous (scale-to-zero)

  • No, periodic task → Batch Transform

Conclusion

Choosing the right inference type in SageMaker can significantly impact your ML service performance and infrastructure costs. Understanding the differences in instance management and pricing models is key to cost optimization. There’s no universal solution — analyze your requirements for latency, volume, frequency, and choose the optimal option. And very important that in some projects, you may need a combination of several types for different use cases.

Только зарегистрированные пользователи могут участвовать в опросе. Войдите, пожалуйста.
Which SageMaker inference type do you use in production?
100%Real-Time Inference1
100%Serverless Inference1
0%Batch Transform0
100%Asynchronous Inference1
0%Don't use SageMaker (or just experimenting)0
Проголосовал 1 пользователь. Воздержавшихся нет.