Amazon Bedrock成本控制：利用Sentinel机制主动管理Token使用量

As organizations embrace generative AI powered by Amazon Bedrock, they face the challenge of managing costs associated with the token-based pricing model. Amazon Bedrock offers a pay-as-you-go pricing structure that can potentially lead to unexpected and excessive bills if usage is not carefully monitored. Traditional methods of cost monitoring, such as budget alerts and cost anomaly detection, can help spot unexpectedly high usage but are reactive in nature. To address costs proactively, it is vital to use both leading and trailing indicators.

Leading indicators are predictive signals that help you anticipate future trends or potential issues before they fully materialize. These indicators provide proactive insights that allow for timely intervention. In contrast, trailing indicators are retrospective measurements that confirm what has already occurred. By understanding and tracking both types of indicators, organizations can develop more strategic and responsive decision-making processes.

In this two-part series, we introduce a comprehensive solution for proactively managing Amazon Bedrock inference costs. Our approach features a cost sentry mechanism designed to establish and enforce token usage limits, providing organizations with a robust framework for controlling generative AI expenses. In this post, we focus on core architecture, cost sentry design, token usage tracking, and initial budget enforcement strategies. In Part 2, we explore advanced monitoring techniques, custom tagging, reporting, and long-term cost optimization best practices. The goal is to deliver a predictable, cost-effective approach to Amazon Bedrock deployments that aligns with organizational financial constraints.

Solution overview

Amazon Bedrock is billed on a token usage-based policy with charges based on the input and output tokens used. The rate charged depends on the model used and AWS Region where inference is performed. Developers must implement robust token management strategies in their applications to help prevent runaway costs, making sure generative AI applications include circuit breakers and consumption limits that align with budgetary constraints.

To address this, you can configure Amazon CloudWatch alarms or monitor costs with billing alerts and budgets, but these mechanisms look at incurred costs or usage after the fact. Another option is the Generative AI Gateway Solution in the AWS Solutions Library, which uses LiteLLM to enforce budgetary limits for Amazon Bedrock and other model providers.

This solution was developed to identify a proactive, centralized mechanism that could limit the generative AI usage to a specific budget that can be adjusted. This approach uses serverless workflows and native Amazon Bedrock integration that offers less operational complexity while providing large-scale performance and scaling.

When building applications with Amazon Bedrock, it is common practice to access the service through a developed API, either synchronously through a REST API or asynchronously through a queuing system. The following diagram compares these architectures.

For synchronous interactions, clients make direct REST API calls to Amazon Bedrock, passing in the necessary parameters. In an asynchronous architecture, clients submit inference requests to a queue or message broker, such as Amazon Simple Queue Service (Amazon SQS). A backend processing system, often implemented as a serverless function or a containerized application, continuously monitors the queue and processes incoming requests. This approach decouples the client from the inference processing, enabling scalability and resilience in handling bursts of requests.

This solution is a centralized mechanism that can be used to interact with Amazon Bedrock to serve as a proactive cost sentry. It is designed using a serverless architecture that uses AWS Step Functions to orchestrate a workflow that validates token usage against configured limits before allowing Amazon Bedrock inference requests to proceed. This solution makes sure that generative AI applications stay within predefined budgetary boundaries, providing cost predictability and control.

The following diagram illustrates the architecture we build in this post.

The core components of this solution include:

Rate limiter workflow

Amazon DynamoDB

Amazon Bedrock model router

Token usage tracking

Budget configuration

Cost and usage visibility

AWS Cost Explorer

The solution follows a serverless architecture approach, using managed AWS services like Step Functions, AWS Lambda, DynamoDB, and CloudWatch to provide a scalable, extensible, and cost-effective implementation.

The goal is to provide a proactive method of setting generative AI usage limits that operate as a leading indicator to limit usage:

Proactive budgeting

Model-specific budgets

Default budget fallback

Monitoring

Serverless architecture

Extensibility

Step Functions workflows

In this section, we explore how the solution uses Step Functions to implement rate limiting and model routing workflows.

Rate limiting workflow

The rate limiting workflow is designed to take a minimal JSON document as input with the following format:

{  "modelId": "string",       // e.g. "anthropic.claude-3-sonnet-20240229-v1:0"  "prompt": {    "messages": [      {        "role": "string",    // "system", "user", or "assistant"        "content": "string"      }    ]  }}

This workflow is the core component that enforces budgetary controls. The key steps are as follows:

A Lambda function retrieves the start and end dates for the current month, which is used to query token usage metrics for the appropriate time range. The workflow queries CloudWatch to retrieve the current month’s token usage metrics for the specified Amazon Bedrock model. The workflow retrieves the configured token usage limit for the specified Amazon Bedrock model from DynamoDB. If no specific limit is found, it falls back to retrieving the default limit. The workflow compares the current token usage against the configured limit to determine if the budget has been exceeded or not. If the token usage is within the budget, this step invokes the Amazon Bedrock model router state machine to perform the actual inference request. Depending on the outcome of the budget check, the workflow returns either the formatted inference result or an error indicating that the budget has been exceeded.

The following diagram illustrates the Step Functions workflow.

Amazon Bedrock model router workflow

The Amazon Bedrock model router workflow is a separate Step Functions state machine responsible for invoking the appropriate Amazon Bedrock model based on the request parameters. It abstracts the complexity of handling different I/O formats required by various Amazon Bedrock models and combines the result into a standardized format.

The key steps in the workflow include:

Based on the provided model ID, the workflow determines the specific Amazon Bedrock model to be invoked. The workflow calls the appropriate Amazon Bedrock model with the required input parameters. The workflow normalizes the output from the Amazon Bedrock model to a consistent format for further processing or returning to the client. The workflow returns the transformed inference result to the calling workflow (budget sentry workflow).

The following diagram illustrates the Step Functions workflow.

You can implement additional steps to handle error conditions and format the output appropriately. In this example, the Anthropic flow includes error processing.

Token usage tracking with CloudWatch metrics

The Amazon Bedrock cost sentry uses the CloudWatch integration with Amazon Bedrock to retrieve current token usage metrics. These metrics are used to enforce budgetary limits proactively. For example, see the following query:

{    "sparkline": false,    "metrics": [        [ { "expression": "SEARCH('{AWS/Bedrock} MetricName=\"InputTokenCount\"', 'Sum', 60)", "region": "us-east-1" } ],        [ { "expression": "SEARCH('{AWS/Bedrock} MetricName=\"OutputTokenCount\"', 'Sum', 60)", "region": "us-east-1" } ]    ],    "legend": {        "position": "right"    },    "title": "InputTokenCount, OutputTokenCount",    "region": "us-east-1",    "liveData": true,    "view": "gauge",    "stacked": false,    "period": 2592000,    "table": {        "summaryColumns": [            "SUM"        ]    },    "yAxis": {        "left": {            "min": 0,            "max": 1000000        }    },    "setPeriodToTimeRange": true,    "trend": false,    "startTime": "2024-05-01T00:00:00Z",    "endTime": "2024-05-30T23:59:59Z"}

This CloudWatch metric query retrieves the total input and output token counts for a specified time range, allowing the rate limiter workflow to accurately enforce budgets based on real-time usage data.

Budget configuration with DynamoDB

The Amazon Bedrock cost sentry stores token usage limits in a DynamoDB table, providing seamless configuration and updates to individual model budgets or the default limit. For example, see the following code:

{    "modelId": "anthropic.claude-3-sonnet-20240229-v1:0",    "limit": {        "input": 1000000,        "output": 3000000    }}

In this example, the token usage limit for the specified Amazon Bedrock model (anthropic.claude-3-sonnet-20240229-v1:0) is set to 1,000,000 input tokens and 3,000,000 output tokens.

Administrators can quickly update these limits by modifying the corresponding DynamoDB records, providing flexibility in adjusting budgets as needed.

Performance analysis of the rate limiter workflow

To assess the performance impact of introducing the workflow, we used an array of inference requests. Test cases included various prompts designed to generate responses ranging from concise answers to detailed explanations over 500 words, effectively testing the workflow’s performance across different output token sizes. The workflow demonstrated exceptional performance characteristics across 501 successful executions, handling a diverse set of inference requests from brief responses to extensive content generation.

The workflow maintains consistent execution patterns while processing requests ranging from 6.76 seconds to 32.24 seconds in total duration, with the variation primarily reflecting the different output token requirements of each request:

Quick responses (under 10 seconds)

Medium-length content (11–22 seconds)

Extended generation (up to 32 seconds)

The following diagram illustrates our time distribution findings.

The time distribution analysis reveals highly optimized resource utilization:

Amazon Bedrock model router

Other workflow steps

System overhead

This performance profile aligns with best practices for workflow orchestration, where minimizing overhead and maintaining consistent execution patterns are crucial for reliability. The workflow’s efficiency is evidenced by its remarkably low system overhead of just 0.09%, demonstrating effective use of the built-in controls and state management capabilities of Step Functions regardless of the response size being generated.

The execution consistency is particularly noteworthy, with a predictable event pattern of 47–49 events per execution, regardless of the inference request complexity or output size. This predictability is essential for workload management and resource planning, especially when handling varied request complexities and token outputs.

These metrics indicate a well-architected workflow that effectively uses Step Functions Express workflow capabilities for high-volume event processing while maintaining minimal overhead and consistent performance characteristics across both simple queries and complex, token-intensive inference requests.

Cost analysis

To analyze the cost implications, estimates were generated using the AWS Pricing Calculator for both Standard and Express Step Functions workflows, assuming 100,000 requests per month. The following table summarizes these estimates.

Detailed EstimateRegionDescriptionServiceUpfrontMonthlyFirst 12 Months TotalCurrencyConfiguration Summary US East (Ohio) Step Functions Standard Step Functions – Standard Workflows 0$37.40 $448.80 USD Workflow requests (100,000 per month)State transitions per workflow (15) US East (Ohio) Step Functions Express Step Functions – Express Workflows 0$3.75 $45 USD Duration of each workflow (35,000)Memory consumed by each workflow (64 MB)Workflow requests (100,000 per month)

The cost analysis revealed that the Step Functions Express workflow offers a more cost-effective solution compared to the Standard workflow, with potential cost savings of up to 90% for the same workload. There is a potential for cost reduction for Standard if the number of steps can be optimized. For example, a few formatting pass steps could potentially be removed, but these steps help format the downstream input to later steps.

Consult the AWS Pricing Calculator for more details on pricing and to run your own scenario.

Conclusion

In this solution, we used Step Functions to build a system that serves as a leading indicator because it tracks rate limiting and token usage, warning us immediately when we’re approaching our usage limits. In Part 2, we discuss combining this with trailing indicators to stay aware of usage and costs.

About the author

Jason Salcido is a Startups Senior Solutions Architect with nearly 30 years of experience pioneering innovative solutions for organizations from startups to enterprises. His expertise spans cloud architecture, serverless computing, machine learning, generative AI, and distributed systems. Jason combines deep technical knowledge with a forward-thinking approach to design scalable solutions that drive value, while translating complex concepts into actionable strategies.