AWS Machine Learning Blog 09月27日 00:11
利用生成式AI赋能站点可靠性工程师
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了如何构建一个多智能体(multi-agent)的SRE助手,以应对现代分布式系统中站点可靠性工程师(SRE)面临的复杂挑战。通过集成Amazon Bedrock AgentCore、LangGraph和模型上下文协议(MCP),该系统能够让SRE用自然语言查询基础设施问题,并获得结合了日志、指标、Kubernetes事件和操作手册的综合性、可操作性见解。文章详细阐述了该解决方案的架构、关键组件(如AgentCore Gateway、Memory和Runtime)以及如何将现有API转换为MCP工具,从而实现高效的事件响应和基础设施管理。

💡 **生成式AI助力SRE应对复杂系统挑战:** 现代分布式系统对SRE提出了严峻要求,需要在生产事件中快速整合日志、指标、Kubernetes事件和操作手册等信息来定位问题。传统工具缺乏跨系统信息整合能力,导致SRE需手动分析。生成式AI解决方案允许SRE用自然语言提问,并获得结合多源数据的综合性、可操作性见解,将事件响应从手动耗时转变为高效协作。

🚀 **多智能体架构实现深度上下文智能:** 该解决方案构建了一个由Supervisor Agent协调的四个专业AI智能体(Kubernetes、日志、指标、操作手册)的系统。这些智能体协同工作,利用Amazon Bedrock AgentCore等技术,访问实时基础设施数据,进行自然语言查询,并提供自动化的事件响应和基础设施管理支持。这为SRE团队提供了解决复杂问题的深度上下文智能。

🛠️ **Amazon Bedrock AgentCore赋能标准化与持久化智能:** Amazon Bedrock AgentCore在整个解决方案中扮演核心角色。AgentCore Gateway将现有API转换为模型上下文协议(MCP)工具,实现标准化接口;AgentCore Runtime提供无服务器执行环境,支持自动扩展;AgentCore Memory则通过用户偏好、基础设施知识和调查记忆策略,使SRE代理具备个性化和跨会话学习能力,实现持久化的智能分析。

🔍 **增强可观测性与开发到生产的流程:** 通过Amazon Bedrock AgentCore Observability,系统能够自动捕获LLM调用指标、工具执行跟踪、内存操作和端到端请求跟踪,并集成到Amazon CloudWatch,提供全面的监控能力。文章还概述了从本地开发到生产部署的结构化四步流程,确保在不同环境中保持一致性,并简化了API到MCP工具的转换过程。

Site reliability engineers (SREs) face an increasingly complex challenge in modern distributed systems. During production incidents, they must rapidly correlate data from multiple sources—logs, metrics, Kubernetes events, and operational runbooks—to identify root causes and implement solutions. Traditional monitoring tools provide raw data but lack the intelligence to synthesize information across these diverse systems, often leaving SREs to manually piece together the story behind system failures.

With a generative AI solution, SREs can ask their infrastructure questions in natural language. For example, they can ask “Why are the payment-service pods crash looping?” or “What’s causing the API latency spike?” and receive comprehensive, actionable insights that combine infrastructure status, log analysis, performance metrics, and step-by-step remediation procedures. This capability transforms incident response from a manual, time-intensive process into a time-efficient, collaborative investigation.

In this post, we demonstrate how to build a multi-agent SRE assistant using Amazon Bedrock AgentCore, LangGraph, and the Model Context Protocol (MCP). This system deploys specialized AI agents that collaborate to provide the deep, contextual intelligence that modern SRE teams need for effective incident response and infrastructure management. We walk you through the complete implementation, from setting up the demo environment to deploying on Amazon Bedrock AgentCore Runtime for production use.

Solution overview

This solution uses a comprehensive multi-agent architecture that addresses the challenges of modern SRE operations through intelligent automation. The solution consists of four specialized AI agents working together under a supervisor agent to provide comprehensive infrastructure analysis and incident response assistance.

The examples in this post use synthetically generated data from our demo environment. The backend servers simulate realistic Kubernetes clusters, application logs, performance metrics, and operational runbooks. In production deployments, these stub servers would be replaced with connections to your actual infrastructure systems, monitoring services, and documentation repositories.

The architecture demonstrates several key capabilities:

The following diagram illustrates the solution architecture.

The architecture demonstrates how the SRE support agent integrates seamlessly with Amazon Bedrock AgentCore components:

The multi-agent solution uses a supervisor-agent pattern where a central orchestrator coordinates five specialized agents:

Using Amazon Bedrock AgentCore primitives

The solution showcases the power of Amazon Bedrock AgentCore by using multiple core primitives. The solution supports two providers for Anthropic’s LLMs. Amazon Bedrock supports Anthropic’s Claude 3.7 Sonnet for AWS integrated deployments, and Anthropic API supports Anthropic’s Claude 4 Sonnet for direct API access.

The Amazon Bedrock AgentCore Gateway component converts the SRE agent’s backend APIs (Kubernetes, application logs, performance metrics, and operational runbooks) into Model Context Protocol (MCP) tools. This enables agents built with an open-source framework supporting MCP (such as LangGraph in this post) to seamlessly access infrastructure APIs.

Security for the entire solution is provided by Amazon Bedrock AgentCore Identity. It supports ingress authentication for secure access control for agents connecting to the gateway, and egress authentication to manage authentication with backend servers, providing secure API access without hardcoding credentials.

The serverless execution environment for deploying the SRE agent in production is provided by Amazon Bedrock AgentCore Runtime. It automatically scales from zero to handle concurrent incident investigations while maintaining complete session isolation. Amazon Bedrock AgentCore Runtime supports both OAuth and AWS Identity and Access Management (IAM) for agent authentication. Applications that invoke agents must have appropriate IAM permissions and trust policies. For more information, see Identity and access management for Amazon Bedrock AgentCore.

Amazon Bedrock AgentCore Memory transforms the SRE agent from a stateless system into an intelligent learning assistant that personalizes investigations based on user preferences and historical context. The memory component provides three distinct strategies:

The memory component demonstrates its value through personalized investigations. When both Alice and Carol investigate “API response times have degraded 3x in the last hour,” they receive identical technical findings but completely different presentations.

Alice receives a technical analysis:

memory_client.retrieve_user_preferences(user_id="Alice")# Returns: {"investigation_style": "detailed_systematic_analysis", "reports": "technical_exposition_with_troubleshooting_steps"}

Carol receives an executive summary:

memory_client.retrieve_user_preferences(user_id="Carol") # Returns: {"investigation_style": "business_impact_focused","reports": "executive_summary_without_technical_details"}

Adding observability to the SRE agent

Adding observability to an SRE agent deployed on Amazon Bedrock AgentCore Runtime is straightforward using the Amazon Bedrock AgentCore Observability primitive. This enables comprehensive monitoring through Amazon CloudWatch with metrics, traces, and logs. Setting up observability requires three steps:

    Add the OpenTelemetry packages to your pyproject.toml:
    dependencies = [    # ... other dependencies ...    "opentelemetry-instrumentation-langchain",    "aws-opentelemetry-distro~=0.10.1",  ]
    Configure observability for your agents to enable metrics in CloudWatch. Start your container using the opentelemetry-instrument utility to automatically instrument your application.

The following command is added to the Dockerfile for the SRE agent:

# Run application with OpenTelemetry instrumentation CMD ["uv", "run", "opentelemetry-instrument", "uvicorn", "sre_agent.agent_runtime:app", "--host", "0.0.0.0", "--port", "8080"]

As shown in the following screenshot, with observability enabled, you gain visibility into the following:

The observability primitive automatically captures these metrics without additional code changes, providing production-grade monitoring capabilities out of the box.

Development to production flow

The SRE agent follows a four-step structured deployment process from local development to production, with detailed procedures documented in Development to Production Flow in the accompanying GitHub repo:

The deployment process maintains consistency across environments: the core agent code (sre_agent/) remains unchanged, and the deployment/ folder contains deployment-specific utilities. The same agent works locally and in production through environment configuration, with Amazon Bedrock AgentCore Gateway providing MCP tools access across different stages of development and deployment.

Implementation walkthrough

In the following section, we focus on how Amazon Bedrock AgentCore Gateway, Memory, and Runtime work together to build this multi-agent collaboration solution and deploy it end-to-end with MCP support and persistent intelligence.

We start by setting up the repository and establishing the local runtime environment with API keys, LLM providers, and demo infrastructure. We then bring core AgentCore components online by creating the gateway for standardized API access, configuring authentication, and establishing tool connectivity. We add intelligence through AgentCore Memory, creating strategies for user preferences and investigation history while loading personas for personalized incident response. Finally, we configure individual agents with specialized tools, integrate memory capabilities, orchestrate collaborative workflows, and deploy to AgentCore Runtime with full observability.

Detailed instructions for each step are provided in the repository:

Prerequisites

You can find the port forwarding requirements and other setup instructions in the README file’s Prerequisites section.

Convert APIs to MCP tools with Amazon Bedrock AgentCore Gateway

Amazon Bedrock AgentCore Gateway demonstrates the power of protocol standardization by converting existing backend APIs into MCP tools that agent frameworks can consume. This transformation happens seamlessly, requiring only OpenAPI specifications.

Upload OpenAPI specifications

The gateway process begins by uploading your existing API specifications to Amazon Simple Storage Service (Amazon S3). The create_gateway.sh script automatically handles uploading the four API specifications (Kubernetes, Logs, Metrics, and Runbooks) to your configured S3 bucket with proper metadata and content types. These specifications will be used to create API endpoint targets in the gateway.

Create an identity provider and gateway

Authentication is handled seamlessly through Amazon Bedrock AgentCore Identity. The main.py script creates both the credential provider and gateway:

# Create AgentCore Gateway with JWT authorizationdef create_gateway(    client: Any,    gateway_name: str,    role_arn: str,    discovery_url: str,    allowed_clients: list = None,    description: str = "AgentCore Gateway created via SDK",    search_type: str = "SEMANTIC",    protocol_version: str = "2025-03-26",) -> Dict[str, Any]:        # Build auth config for Cognito    auth_config = {"customJWTAuthorizer": {"discoveryUrl": discovery_url}}    if allowed_clients:        auth_config["customJWTAuthorizer"]["allowedClients"] = allowed_clients        protocol_configuration = {        "mcp": {"searchType": search_type, "supportedVersions": [protocol_version]}    }    response = client.create_gateway(        name=gateway_name,        roleArn=role_arn,        protocolType="MCP",        authorizerType="CUSTOM_JWT",        authorizerConfiguration=auth_config,        protocolConfiguration=protocol_configuration,        description=description,        exceptionLevel='DEBUG'    )    return response

Deploy API endpoint targets with credential providers

Each API becomes an MCP target through the gateway. The solution automatically handles credential management:

def create_api_endpoint_target(    client: Any,    gateway_id: str,    s3_uri: str,    provider_arn: str,    target_name_prefix: str = "open",    description: str = "API Endpoint Target for OpenAPI schema",) -> Dict[str, Any]:        api_target_config = {"mcp": {"openApiSchema": {"s3": {"uri": s3_uri}}}}    # API key credential provider configuration    credential_config = {        "credentialProviderType": "API_KEY",        "credentialProvider": {            "apiKeyCredentialProvider": {                "providerArn": provider_arn,                "credentialLocation": "HEADER",                "credentialParameterName": "X-API-KEY",            }        },    }        response = client.create_gateway_target(        gatewayIdentifier=gateway_id,        name=target_name_prefix,        description=description,        targetConfiguration=api_target_config,        credentialProviderConfigurations=[credential_config],    )    return response

Validate MCP tools are ready for agent framework

Post-deployment, Amazon Bedrock AgentCore Gateway provides a standardized /mcp endpoint secured with JWT tokens. Testing the deployment with mcp_cmds.sh reveals the power of this transformation:

Tool summary:================Total tools found: 21Tool names:• x_amz_bedrock_agentcore_search• k8s-api___get_cluster_events• k8s-api___get_deployment_status• k8s-api___get_node_status• k8s-api___get_pod_status• k8s-api___get_resource_usage• logs-api___analyze_log_patterns• logs-api___count_log_events• logs-api___get_error_logs• logs-api___get_recent_logs• logs-api___search_logs• metrics-api___analyze_trends• metrics-api___get_availability_metrics• metrics-api___get_error_rates• metrics-api___get_performance_metrics• metrics-api___get_resource_metrics• runbooks-api___get_common_resolutions• runbooks-api___get_escalation_procedures• runbooks-api___get_incident_playbook• runbooks-api___get_troubleshooting_guide• runbooks-api___search_runbooks

Universal agent framework compatibility

This MCP-standardized gateway can now be configured as a Streamable-HTTP server for MCP clients, including AWS Strands, Amazon’s agent development framework, LangGraph, the framework used in our SRE agent implementation, and CrewAI, a multi-agent collaboration framework.

The advantage of this approach is that existing APIs require no modification—only OpenAPI specifications. Amazon Bedrock AgentCore Gateway handles the following:

This means you can take existing infrastructure APIs (Kubernetes, monitoring, logging, documentation) and instantly make them available to AI agent frameworks that support MCP—through a single, secure, standardized interface.

Implement persistent intelligence with Amazon Bedrock AgentCore Memory

Whereas Amazon Bedrock AgentCore Gateway provides seamless API access, Amazon Bedrock AgentCore Memory transforms the SRE agent from a stateless system into an intelligent, learning assistant. The memory implementation demonstrates how a few lines of code can enable sophisticated personalization and cross-session knowledge retention.

Initialize memory strategies

The SRE agent memory component is built on Amazon Bedrock AgentCore Memory’s event-based model with automatic namespace routing. During initialization, the solution creates three memory strategies with specific namespace patterns:

from sre_agent.memory.client import SREMemoryClientfrom sre_agent.memory.strategies import create_memory_strategies# Initialize memory clientmemory_client = SREMemoryClient(    memory_name="sre_agent_memory",    region="us-east-1")# Create three specialized memory strategiesstrategies = create_memory_strategies()for strategy in strategies:    memory_client.create_strategy(strategy)

The three strategies each serve distinct purposes:

Load user personas and preferences

The solution comes preconfigured with user personas that demonstrate personalized investigations. The manage_memories.py script loads these personas:

# Load Alice - Technical SRE Engineeralice_preferences = {    "investigation_style": "detailed_systematic_analysis",    "communication": ["#alice-alerts", "#sre-team"],    "escalation": {"contact": "alice.manager@company.com", "threshold": "15min"},    "reports": "technical_exposition_with_troubleshooting_steps",    "timezone": "UTC"}# Load Carol - Executive/Directorcarol_preferences = {    "investigation_style": "business_impact_focused",    "communication": ["#carol-executive", "#strategic-alerts"],    "escalation": {"contact": "carol.director@company.com", "threshold": "5min"},    "reports": "executive_summary_without_technical_details",    "timezone": "EST"}# Store preferences using memory clientmemory_client.store_user_preference("Alice", alice_preferences)memory_client.store_user_preference("Carol", carol_preferences)

Automatic namespace routing in action

The power of Amazon Bedrock AgentCore Memory lies in its automatic namespace routing. When the SRE agent creates events, it only needs to provide the actor_id—Amazon Bedrock AgentCore Memory automatically determines which namespaces the event belongs to:

# During investigation, the supervisor agent stores contextmemory_client.create_event(    memory_id="sre_agent_memory-abc123",    actor_id="Alice",  # AgentCore Memory routes this automatically    session_id="investigation_2025_01_15",    messages=[("investigation_started", "USER")])# Memory system automatically:# 1. Checks strategy namespaces <!-- "all" is necessary here for technical accuracy --># 2. Matches actor_id "Alice" to /sre/users/Alice/preferences# 3. Stores event in User Preferences Strategy# 4. Makes event available for future retrievals

Validate the personalized investigation experience

The memory component’s impact becomes clear when both Alice and Carol investigate the same issue. Using identical technical findings, the solution produces completely different presentations of the same underlying content.

Alice’s technical report contains detailed systematic analysis for technical teams:

Technical Investigation SummaryRoot Cause: Payment processor memory leak causing OOM killsAnalysis:- Pod restart frequency increased 300% at 14:23 UTC- Memory utilization peaked at 8.2GB (80% of container limit)- JVM garbage collection latency spiked to 2.3sNext Step:1. Implement heap dump analysis (`kubectl exec payment-pod -- jmap`)2. Review recent code deployments for memory management changes3. Consider increasing memory limits and implementing graceful shutdown

Carol’s executive summary contains business impact focused for executive stakeholders:

Business Impact AssessmentStatus: CRITICAL - Customer payment processing degradedImpact: 23% transaction failure rate, $47K revenue at riskTimeline: Issue detected 14:23 UTC, resolution ETA 45 minutesBusiness Actions: - Customer communication initiated via status page - Finance team alerted for revenue impact tracking - Escalating to VP Engineering if not resolved by 15:15 UTC

The memory component enables this personalization while continuously learning from each investigation, building organizational knowledge that improves incident response over time.

Deploy to production with Amazon Bedrock AgentCore Runtime

Amazon Bedrock AgentCore makes it straightforward to deploy existing agents to production. The process involves three key steps: containerizing your agent, deploying to Amazon Bedrock AgentCore Runtime, and invoking the deployed agent.

Containerize your agent

Amazon Bedrock AgentCore Runtime requires ARM64 containers. The following code shows the complete Dockerfile:

# Use uv's ARM64 Python base imageFROM --platform=linux/arm64 ghcr.io/astral-sh/uv:python3.12-bookworm-slimWORKDIR /app# Copy uv filesCOPY pyproject.toml uv.lock ./# Install dependenciesRUN uv sync --frozen --no-dev# Copy SRE agent moduleCOPY sre_agent/ ./sre_agent/# Set environment variables# Note: Set DEBUG=true to enable debug logging and tracesENV PYTHONPATH="/app" \    PYTHONDONTWRITEBYTECODE=1 \    PYTHONUNBUFFERED=1# Expose portEXPOSE 8080# Run application with OpenTelemetry instrumentationCMD ["uv", "run", "opentelemetry-instrument", "uvicorn", "sre_agent.agent_runtime:app", "--host", "0.0.0.0", "--port", "8080"]

Existing agents just need a FastAPI wrapper (agent_runtime:app) to become compatible with Amazon Bedrock AgentCore, and we add opentelemetry-instrument to enable observability through Amazon Bedrock AgentCore.

Deploy to Amazon Bedrock AgentCore Runtime

Deploying to Amazon Bedrock AgentCore Runtime is straightforward with the deploy_agent_runtime.py script:

import boto3# Create AgentCore clientclient = boto3.client('bedrock-agentcore', region_name=region)# Environment variables for your agentenv_vars = {    'GATEWAY_ACCESS_TOKEN': gateway_access_token,    'LLM_PROVIDER': llm_provider,    'ANTHROPIC_API_KEY': anthropic_api_key  # if using Anthropic}# Deploy container to AgentCore Runtimeresponse = client.create_agent_runtime(    agentRuntimeName=runtime_name,    agentRuntimeArtifact={        'containerConfiguration': {            'containerUri': container_uri  # Your ECR container URI        }    },    networkConfiguration={"networkMode": "PUBLIC"},    roleArn=role_arn,    environmentVariables=env_vars)print(f"Agent Runtime ARN: {response['agentRuntimeArn']}")

Amazon Bedrock AgentCore handles the infrastructure, scaling, and session management automatically.

Invoke your deployed agent

Calling your deployed agent is just as simple with invoke_agent_runtime.py:

# Prepare your query with user_id and session_id for memory personalizationpayload = json.dumps({    "input": {        "prompt": "API response times have degraded 3x in the last hour",        "user_id": "Alice",  # User for personalized investigation        "session_id": "investigation-20250127-123456"  # Session for context    }})# Invoke the deployed agentresponse = agent_core_client.invoke_agent_runtime(    agentRuntimeArn=runtime_arn,    runtimeSessionId=session_id,    payload=payload,    qualifier="DEFAULT")# Get the responseresponse_data = json.loads(response['response'].read())print(response_data)  # Full response includes output with agent's investigation

Key benefits of Amazon Bedrock AgentCore Runtime

Amazon Bedrock AgentCore Runtime offers the following key benefits:

The complete deployment process, including building containers and handling AWS permissions, is documented in the Deployment Guide.

Real-world use cases

Let’s explore how the SRE agent handles common incident response scenarios with a real investigation.

When facing a production issue, you can query the system in natural language. The solution uses Amazon Bedrock AgentCore Memory to personalize the investigation based on your role and preferences:

export USER_ID=Alicesre-agent --prompt "API response times have degraded 3x in the last hour"

The supervisor retrieves Alice’s preferences from memory (detailed systematic analysis style) and creates an investigation plan tailored to her role as a Technical SRE:

Investigation Plan1. Use metrics_agent to analyze API performance metrics including response times, error rates, and resource utilization to identify the extent and pattern of slowdown2. Use logs_agent to examine application logs for errors related to slow API responses, focusing on database connection issues and memory errors3. Use kubernetes_agent to check pod status and resource constraints, especially for web-service and database pods, looking for CrashLoopBackOff states and missing ConfigMapsComplexity: SimpleAuto-execute: YesAgents involved: Metrics Agent, Logs Agent, Kubernetes Agent

The agents investigate sequentially according to the plan, each contributing their specialized analysis. The solution then aggregates these findings into a comprehensive executive summary:

Executive SummaryKey Insights- Root Cause: Database service failure due to missing ConfigMap 'database-config' in production namespace, causing database pod to crash repeatedly- Impact: Severe performance degradation with API response times increased from 150ms to 5000ms (33x slower)- Severity: High - Database unavailability, memory exhaustion (100%), and CPU saturation (95%) causing 75% error rateNext Steps1. Immediate (< 1 hour): Create/update ConfigMap 'database-config' in production namespace and restart database pod2. Short-term (< 24 hours):    - Fix permissions on '/var/lib/postgresql/data' directory   - Increase Java heap space for web-service to address OutOfMemoryErrors   - Optimize UserService.loadAllUsers method causing memory issues3. Long-term (< 1 week):    - Implement resource monitoring with alerts for CPU (>80%), memory (>90%)   - Optimize slow database queries, particularly "SELECT * FROM users WHERE status='active'"   - Scale up resources or implement autoscaling for web-serviceCritical Alerts- Database pod (database-pod-7b9c4d8f2a-x5m1q) in CrashLoopBackOff state- Web-service experiencing OutOfMemoryErrors in UserService.loadAllUsers(UserService.java:45)- Node-3 experiencing memory pressure (>85% usage)- Web-app-deployment showing readiness probe failures with 503 errorsTroubleshooting Steps1. Verify ConfigMap status: `kubectl get configmap database-config -n production`2. Check database pod logs: `kubectl logs database-pod-7b9c4d8f2a-x5m1q -n production`3. Create/update ConfigMap: `kubectl create configmap database-config --from-file=database.conf -n production`4. Fix data directory permissions: `kubectl exec database-pod-7b9c4d8f2a-x5m1q -n production -- chmod -R 700 /var/lib/postgresql/data`5. Restart database pod: `kubectl delete pod database-pod-7b9c4d8f2a-x5m1q -n production`

This investigation demonstrates how Amazon Bedrock AgentCore primitives work together:

The SRE agent demonstrates intelligent agent orchestration, with the supervisor routing work to specialists based on the investigation plan. The solution’s memory capabilities make sure each investigation builds organizational knowledge and provides personalized experiences based on user roles and preferences.

This investigation showcases several key capabilities:

Business impact

Organizations implementing AI-powered SRE assistance report significant improvements in key operational metrics. Initial investigations that previously took 30–45 minutes can now be completed in 5–10 minutes, providing SREs with comprehensive context before diving into detailed analysis. This dramatic reduction in investigation time translates directly to faster incident resolution and reduced downtime.The solution improves how SREs interact with their infrastructure. Instead of navigating multiple dashboards and tools, engineers can ask questions in natural language and receive aggregated insights from relevant data sources. This reduction in context switching enables teams to maintain focus during critical incidents and reduces cognitive load during investigations.Perhaps most importantly, the solution democratizes knowledge across the team. All team members can access the same comprehensive investigation techniques, reducing dependency on tribal knowledge and on-call burden. The consistent methodology provided by the solution makes sure investigation approaches remain uniform across team members and incident types, improving overall reliability and reducing the chance of missed evidence.

The automatically generated investigation reports provide valuable documentation for post-incident reviews and help teams learn from each incident, building organizational knowledge over time. Furthermore, the solution extends existing AWS infrastructure investments, working alongside services like Amazon CloudWatch, AWS Systems Manager, and other AWS operational tools to provide a unified operational intelligence system.

Extending the solution

The modular architecture makes it straightforward to extend the solution for your specific needs.

For example, you can add specialized agents for your domain:

You can also replace the demo APIs with connections to your actual systems:

Clean up

To avoid incurring future charges, use the cleanup script to remove the billable AWS resources created during the demo:

# Complete cleanup - deletes AWS resources and local files./scripts/cleanup.sh

This script automatically performs the following actions:

For detailed cleanup instructions, refer to Cleanup Instructions.

Conclusion

The SRE agent demonstrates how multi-agent systems can transform incident response from a manual, time-intensive process into a time-efficient, collaborative investigation that provides SREs with the insights they need to resolve issues quickly and confidently.

By combining the enterprise-grade infrastructure of Amazon Bedrock AgentCore with standardized tool access in MCP, we’ve created a foundation that can adapt as your infrastructure evolves and new capabilities emerge.

The complete implementation is available in our GitHub repository, including demo environments, configuration guides, and extension examples. We encourage you to explore the solution, customize it for your infrastructure, and share your experiences with the community.

To get started building your own SRE assistant, refer to the following resources:


About the authors

Amit Arora is an AI and ML Specialist Architect at Amazon Web Services, helping enterprise customers use cloud-based machine learning services to rapidly scale their innovations. He is also an adjunct lecturer in the MS data science and analytics program at Georgetown University in Washington, D.C.

Dheeraj Oruganty is a Delivery Consultant at Amazon Web Services. He is passionate about building innovative Generative AI and Machine Learning solutions that drive real business impact. His expertise spans Agentic AI Evaluations, Benchmarking and Agent Orchestration, where he actively contributes to research advancing the field. He holds a master’s degree in Data Science from Georgetown University. Outside of work, he enjoys geeking out on cars, motorcycles, and exploring nature.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

站点可靠性工程 生成式AI Amazon Bedrock 多智能体系统 基础设施管理 事件响应 Site Reliability Engineering Generative AI Amazon Bedrock Multi-Agent Systems Infrastructure Management Incident Response
相关文章