AWS Machine Learning Blog 10月23日 03:03
Amazon Bedrock成本控制:利用Sentinel机制主动管理Token使用量
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了一种主动管理Amazon Bedrock生成式AI服务Token使用成本的解决方案。文章强调了区分领先和滞后指标的重要性,并详细阐述了一个名为“Sentinel”的成本控制机制。该机制基于AWS Step Functions构建,通过验证Token使用量是否超出预设限制来防止不必要的开支。文章深入探讨了Rate Limiter工作流、模型路由工作流、CloudWatch指标跟踪以及DynamoDB配置等核心组件,并分析了其性能和成本效益,特别是与Standard Step Functions相比,Express Step Functions在成本上更具优势。此方案旨在为Amazon Bedrock部署提供可预测且经济高效的成本控制。

💡 **主动成本控制的重要性**:文章指出,Amazon Bedrock的Token计费模式可能导致意外的高额账单。传统的成本监控方法(如预算警报)是反应式的。为了有效管理成本,需要结合使用领先指标(预测性信号)和滞后指标(回顾性测量),实现主动干预和策略性决策。

⚙️ **Sentinel成本Sentry机制**:该方案的核心是一个名为“Sentinel”的成本Sentry机制,它采用无服务器架构,利用AWS Step Functions编排工作流,在允许Amazon Bedrock推理请求之前,验证Token使用量是否符合配置的限制。这提供了一个集中式的、可扩展的框架来控制生成式AI的支出。

📊 **核心组件与工作流**:该方案包含一个Rate Limiter工作流,用于检索CloudWatch的Token使用指标,与DynamoDB中存储的预设限制进行比较,并决定是否继续执行请求。另一个是Amazon Bedrock模型路由工作流,负责调用实际的Bedrock模型,并对不同模型的I/O参数进行抽象和标准化处理。

📈 **性能与成本分析**:通过实际测试,该方案在处理不同长度的响应时表现出稳定的执行模式和高效的资源利用率,系统开销极低。成本分析表明,相较于Standard Step Functions,Express Step Functions工作流在处理大量请求时更为经济,可节省高达90%的成本,为Amazon Bedrock部署提供了可预测且经济高效的解决方案。

As organizations embrace generative AI powered by Amazon Bedrock, they face the challenge of managing costs associated with the token-based pricing model. Amazon Bedrock offers a pay-as-you-go pricing structure that can potentially lead to unexpected and excessive bills if usage is not carefully monitored. Traditional methods of cost monitoring, such as budget alerts and cost anomaly detection, can help spot unexpectedly high usage but are reactive in nature. To address costs proactively, it is vital to use both leading and trailing indicators.

Leading indicators are predictive signals that help you anticipate future trends or potential issues before they fully materialize. These indicators provide proactive insights that allow for timely intervention. In contrast, trailing indicators are retrospective measurements that confirm what has already occurred. By understanding and tracking both types of indicators, organizations can develop more strategic and responsive decision-making processes.

In this two-part series, we introduce a comprehensive solution for proactively managing Amazon Bedrock inference costs. Our approach features a cost sentry mechanism designed to establish and enforce token usage limits, providing organizations with a robust framework for controlling generative AI expenses. In this post, we focus on core architecture, cost sentry design, token usage tracking, and initial budget enforcement strategies. In Part 2, we explore advanced monitoring techniques, custom tagging, reporting, and long-term cost optimization best practices. The goal is to deliver a predictable, cost-effective approach to Amazon Bedrock deployments that aligns with organizational financial constraints.

Solution overview

Amazon Bedrock is billed on a token usage-based policy with charges based on the input and output tokens used. The rate charged depends on the model used and AWS Region where inference is performed. Developers must implement robust token management strategies in their applications to help prevent runaway costs, making sure generative AI applications include circuit breakers and consumption limits that align with budgetary constraints.

To address this, you can configure Amazon CloudWatch alarms or monitor costs with billing alerts and budgets, but these mechanisms look at incurred costs or usage after the fact. Another option is the Generative AI Gateway Solution in the AWS Solutions Library, which uses LiteLLM to enforce budgetary limits for Amazon Bedrock and other model providers.

This solution was developed to identify a proactive, centralized mechanism that could limit the generative AI usage to a specific budget that can be adjusted. This approach uses serverless workflows and native Amazon Bedrock integration that offers less operational complexity while providing large-scale performance and scaling.

When building applications with Amazon Bedrock, it is common practice to access the service through a developed API, either synchronously through a REST API or asynchronously through a queuing system. The following diagram compares these architectures.

For synchronous interactions, clients make direct REST API calls to Amazon Bedrock, passing in the necessary parameters. In an asynchronous architecture, clients submit inference requests to a queue or message broker, such as Amazon Simple Queue Service (Amazon SQS). A backend processing system, often implemented as a serverless function or a containerized application, continuously monitors the queue and processes incoming requests. This approach decouples the client from the inference processing, enabling scalability and resilience in handling bursts of requests.

This solution is a centralized mechanism that can be used to interact with Amazon Bedrock to serve as a proactive cost sentry. It is designed using a serverless architecture that uses AWS Step Functions to orchestrate a workflow that validates token usage against configured limits before allowing Amazon Bedrock inference requests to proceed. This solution makes sure that generative AI applications stay within predefined budgetary boundaries, providing cost predictability and control.

The following diagram illustrates the architecture we build in this post.

The core components of this solution include:

The solution follows a serverless architecture approach, using managed AWS services like Step Functions, AWS Lambda, DynamoDB, and CloudWatch to provide a scalable, extensible, and cost-effective implementation.

The goal is to provide a proactive method of setting generative AI usage limits that operate as a leading indicator to limit usage:

Step Functions workflows

In this section, we explore how the solution uses Step Functions to implement rate limiting and model routing workflows.

Rate limiting workflow

The rate limiting workflow is designed to take a minimal JSON document as input with the following format:

{  "modelId": "string",       // e.g. "anthropic.claude-3-sonnet-20240229-v1:0"  "prompt": {    "messages": [      {        "role": "string",    // "system", "user", or "assistant"        "content": "string"      }    ]  }}

This workflow is the core component that enforces budgetary controls. The key steps are as follows:

    A Lambda function retrieves the start and end dates for the current month, which is used to query token usage metrics for the appropriate time range. The workflow queries CloudWatch to retrieve the current month’s token usage metrics for the specified Amazon Bedrock model. The workflow retrieves the configured token usage limit for the specified Amazon Bedrock model from DynamoDB. If no specific limit is found, it falls back to retrieving the default limit. The workflow compares the current token usage against the configured limit to determine if the budget has been exceeded or not. If the token usage is within the budget, this step invokes the Amazon Bedrock model router state machine to perform the actual inference request. Depending on the outcome of the budget check, the workflow returns either the formatted inference result or an error indicating that the budget has been exceeded.

The following diagram illustrates the Step Functions workflow.

Amazon Bedrock model router workflow

The Amazon Bedrock model router workflow is a separate Step Functions state machine responsible for invoking the appropriate Amazon Bedrock model based on the request parameters. It abstracts the complexity of handling different I/O formats required by various Amazon Bedrock models and combines the result into a standardized format.

The key steps in the workflow include:

    Based on the provided model ID, the workflow determines the specific Amazon Bedrock model to be invoked. The workflow calls the appropriate Amazon Bedrock model with the required input parameters. The workflow normalizes the output from the Amazon Bedrock model to a consistent format for further processing or returning to the client. The workflow returns the transformed inference result to the calling workflow (budget sentry workflow).

The following diagram illustrates the Step Functions workflow.

You can implement additional steps to handle error conditions and format the output appropriately. In this example, the Anthropic flow includes error processing.

Token usage tracking with CloudWatch metrics

The Amazon Bedrock cost sentry uses the CloudWatch integration with Amazon Bedrock to retrieve current token usage metrics. These metrics are used to enforce budgetary limits proactively. For example, see the following query:

{    "sparkline": false,    "metrics": [        [ { "expression": "SEARCH('{AWS/Bedrock} MetricName=\"InputTokenCount\"', 'Sum', 60)", "region": "us-east-1" } ],        [ { "expression": "SEARCH('{AWS/Bedrock} MetricName=\"OutputTokenCount\"', 'Sum', 60)", "region": "us-east-1" } ]    ],    "legend": {        "position": "right"    },    "title": "InputTokenCount, OutputTokenCount",    "region": "us-east-1",    "liveData": true,    "view": "gauge",    "stacked": false,    "period": 2592000,    "table": {        "summaryColumns": [            "SUM"        ]    },    "yAxis": {        "left": {            "min": 0,            "max": 1000000        }    },    "setPeriodToTimeRange": true,    "trend": false,    "startTime": "2024-05-01T00:00:00Z",    "endTime": "2024-05-30T23:59:59Z"}

This CloudWatch metric query retrieves the total input and output token counts for a specified time range, allowing the rate limiter workflow to accurately enforce budgets based on real-time usage data.

Budget configuration with DynamoDB

The Amazon Bedrock cost sentry stores token usage limits in a DynamoDB table, providing seamless configuration and updates to individual model budgets or the default limit. For example, see the following code:

{    "modelId": "anthropic.claude-3-sonnet-20240229-v1:0",    "limit": {        "input": 1000000,        "output": 3000000    }}

In this example, the token usage limit for the specified Amazon Bedrock model (anthropic.claude-3-sonnet-20240229-v1:0) is set to 1,000,000 input tokens and 3,000,000 output tokens.

Administrators can quickly update these limits by modifying the corresponding DynamoDB records, providing flexibility in adjusting budgets as needed.

Performance analysis of the rate limiter workflow

To assess the performance impact of introducing the workflow, we used an array of inference requests. Test cases included various prompts designed to generate responses ranging from concise answers to detailed explanations over 500 words, effectively testing the workflow’s performance across different output token sizes. The workflow demonstrated exceptional performance characteristics across 501 successful executions, handling a diverse set of inference requests from brief responses to extensive content generation.

The workflow maintains consistent execution patterns while processing requests ranging from 6.76 seconds to 32.24 seconds in total duration, with the variation primarily reflecting the different output token requirements of each request:

The following diagram illustrates our time distribution findings.

The time distribution analysis reveals highly optimized resource utilization:

This performance profile aligns with best practices for workflow orchestration, where minimizing overhead and maintaining consistent execution patterns are crucial for reliability. The workflow’s efficiency is evidenced by its remarkably low system overhead of just 0.09%, demonstrating effective use of the built-in controls and state management capabilities of Step Functions regardless of the response size being generated.

The execution consistency is particularly noteworthy, with a predictable event pattern of 47–49 events per execution, regardless of the inference request complexity or output size. This predictability is essential for workload management and resource planning, especially when handling varied request complexities and token outputs.

These metrics indicate a well-architected workflow that effectively uses Step Functions Express workflow capabilities for high-volume event processing while maintaining minimal overhead and consistent performance characteristics across both simple queries and complex, token-intensive inference requests.

Cost analysis

To analyze the cost implications, estimates were generated using the AWS Pricing Calculator for both Standard and Express Step Functions workflows, assuming 100,000 requests per month. The following table summarizes these estimates.

Detailed EstimateRegionDescriptionServiceUpfrontMonthlyFirst 12 Months TotalCurrencyConfiguration Summary US East (Ohio) Step Functions Standard Step Functions – Standard Workflows 0$37.40 $448.80 USD Workflow requests (100,000 per month)State transitions per workflow (15) US East (Ohio) Step Functions Express Step Functions – Express Workflows 0$3.75 $45 USD Duration of each workflow (35,000)Memory consumed by each workflow (64 MB)Workflow requests (100,000 per month)

The cost analysis revealed that the Step Functions Express workflow offers a more cost-effective solution compared to the Standard workflow, with potential cost savings of up to 90% for the same workload. There is a potential for cost reduction for Standard if the number of steps can be optimized. For example, a few formatting pass steps could potentially be removed, but these steps help format the downstream input to later steps.

Consult the AWS Pricing Calculator for more details on pricing and to run your own scenario.

Conclusion

In this solution, we used Step Functions to build a system that serves as a leading indicator because it tracks rate limiting and token usage, warning us immediately when we’re approaching our usage limits. In Part 2, we discuss combining this with trailing indicators to stay aware of usage and costs.


About the author

Jason Salcido is a Startups Senior Solutions Architect with nearly 30 years of experience pioneering innovative solutions for organizations from startups to enterprises. His expertise spans cloud architecture, serverless computing, machine learning, generative AI, and distributed systems. Jason combines deep technical knowledge with a forward-thinking approach to design scalable solutions that drive value, while translating complex concepts into actionable strategies.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Amazon Bedrock Generative AI Cost Management Token Usage AWS Step Functions CloudWatch DynamoDB Serverless Cost Optimization Proactive Monitoring Sentinel Rate Limiting
相关文章