At the moment, Amazon Bedrock introduces new service tiers that offer you extra management over your AI workload prices whereas sustaining the efficiency ranges your functions want.
I’m working with clients constructing AI functions. I’ve seen firsthand how totally different workloads require totally different efficiency and value trade-offs. Many organizations operating AI workloads face challenges balancing efficiency necessities with value optimization. Some functions want speedy response instances for real-time interactions, whereas others can course of information extra step by step. With these challenges in thoughts, right now we’re asserting extra choices pricing that offer you extra flexibility in matching your workload necessities with value optimization.
Amazon Bedrock now presents three service tiers for workloads: Precedence, Normal, and Flex. Every tier is designed to match particular workload necessities. Functions have various response time necessities primarily based on the use case. Some functions—equivalent to monetary buying and selling programs—demand the quickest response instances, others want speedy response instances to help enterprise processes like content material technology, and functions equivalent to content material summarization can course of information extra step by step.
The Precedence tier processes your requests forward of different tiers, offering preferential compute allocation for mission-critical functions like customer-facing chat-based assistants and real-time language translation companies, although at a premium value level. The Normal tier offers constant efficiency at common charges for on a regular basis AI duties, superb for content material technology, textual content evaluation, and routine doc processing. For workloads that may deal with longer latency, the Flex tier presents a cheaper possibility with decrease pricing, which is effectively fitted to mannequin evaluations, content material summarization, and multistep evaluation and agentic workflows.
Now you can optimize your spending by matching every workload to probably the most applicable tier. For instance, in the event you’re operating a customer support chat-based assistant that wants fast responses, you should utilize the Precedence tier to get the quickest processing instances. For content material summarization duties that may tolerate longer processing instances, you should utilize the Flex tier to cut back prices whereas sustaining dependable efficiency. For many fashions that help Precedence Tier, clients can notice as much as 25% higher output tokens per second (OTPS) latency in comparison with customary tier.
Selecting the best tier to your workload
Here’s a psychological mannequin that will help you select the best tier to your workload.
| Class | Really useful service tier | Description |
|---|---|---|
| Mission-critical | Precedence | Requests are dealt with forward of different tiers. Decrease latency responses for user-facing apps (for instance, customer support chat assistants, real-time language translation, interactive AI assistants) |
| Enterprise-standard | Normal | Responsive efficiency for vital workloads (for instance, content material technology, textual content evaluation, routine doc processing) |
| Enterprise-noncritical | Flex | Price-efficient for much less pressing workloads (for instance, mannequin evaluations, content material summarization, multistep agentic workflows) |
Begin by reviewing with utility homeowners your present utilization patterns. Subsequent, determine which workloads want speedy responses and which of them can course of information extra step by step. You may then start routing a small portion of your site visitors by way of totally different tiers to check efficiency and value advantages.
The AWS Pricing Calculator helps you estimate prices for various service tiers by getting into your anticipated workload for every tier. You may estimate your finances primarily based in your particular utilization patterns.
To watch your utilization and prices, you should utilize the AWS Service Quotas console or activate mannequin invocation logging in Amazon Bedrock and observe the metrics with Amazon CloudWatch. These instruments present visibility into your token utilization and aid you observe efficiency throughout totally different tiers.
You can begin utilizing the brand new service tiers right now. You select the tier on a per-API name foundation. Right here is an instance utilizing the ChatCompletions OpenAI API, however you’ll be able to move the identical service_tier parameter within the physique of InvokeModel, InvokeModelWithResponseStream, Converse, andConverseStream APIs (for supported fashions):
from openai import OpenAI
shopper = OpenAI(
base_url="https://bedrock-runtime.us-west-2.amazonaws.com/openai/v1",
api_key="$AWS_BEARER_TOKEN_BEDROCK" # Substitute with precise API key
)
completion = shopper.chat.completions.create(
mannequin= "openai.gpt-oss-20b-1:0",
messages=[
{
"role": "developer",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello!"
}
]
service_tier= "precedence" # choices: "precedence | default | flex"
)
print(completion.decisions[0].message)
To be taught extra, try the Amazon Bedrock Consumer Information or contact your AWS account group for detailed planning help.
I’m wanting ahead to listening to how you employ these new pricing choices to optimize your AI workloads. Share your expertise with me on-line on social networks or join with me at AWS occasions.


