AWS Machine Learning Blog 09月25日
亚马逊Bedrock数据自动化利用开放集目标检测增强视频理解
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

亚马逊Bedrock数据自动化是一项云服务,可从非结构化内容中提取洞察,包括文档、图像、视频和音频。对于视频内容,它支持章节分割、帧级文本检测、章节级IAB分类和帧级开放集目标检测(OSOD)。用户可以通过文本提示输入视频并指定要检测的对象,模型将为每个帧输出包含边界框、标签和置信分数的字典。该服务支持高度灵活的输入,可用于广告分析、智能调整大小、智能监控、自定义标签和图像/视频编辑等多种用例,从而实现智能视频分析工作流程。

📌 开放集目标检测(OSOD)允许模型检测已知和以前未见过的对象,包括那些在训练期间未遇到的对象。它支持从特定对象名称到开放式描述的灵活输入提示,并可以在实时中适应用户定义的目标,而无需重新训练。通过结合视觉识别和语义理解(通常通过视觉语言模型),OSOD帮助用户广泛查询系统,即使它不熟悉、模糊或完全新。

🎬 亚马逊Bedrock数据自动化支持帧级OSOD,用户可以输入视频和文本提示来指定要检测的对象。对于每个帧,模型都会输出一个包含边界框(以XYWH格式表示)、相应标签和置信分数的字典。用户可以根据需要进一步自定义输出,例如,在精度优先的情况下过滤高置信度检测。

🔍 该服务的输入文本非常灵活,因此可以在亚马逊Bedrock数据自动化视频蓝图上定义动态字段。例如,它可以用于检测视频中的特定对象(如“检测视频中的苹果”)、跨粒度对象参考(如“检测图像中的所有水果项目”)或开放式问题(如“查找并检测图像中最视觉重要的元素”)。

🛠️ 亚马逊Bedrock数据自动化视频蓝图支持OSOD,可用于多种用例,包括广告分析(比较不同广告位置策略的有效性)、智能调整大小(检测视频中的关键元素以选择适当的调整大小策略)、智能监控(在家用安全系统中检测危险元素)、自定义标签(定义自己的标签并通过视频检索特定结果)和图像/视频编辑(使用基于文本的对象检测精确地删除或替换对象)。

In real-world video and image analysis, businesses often face the challenge of detecting objects that weren’t part of a model’s original training set. This becomes especially difficult in dynamic environments where new, unknown, or user-defined objects frequently appear. For example, media publishers might want to track emerging brands or products in user-generated content; advertisers need to analyze product appearances in influencer videos despite visual variations; retail providers aim to support flexible, descriptive search; self-driving cars must identify unexpected road debris; and manufacturing systems need to catch novel or subtle defects without prior labeling.In all these cases, traditional closed-set object detection (CSOD) models—which only recognize a fixed list of predefined categories—fail to deliver. They either misclassify the unknown objects or ignore them entirely, limiting their usefulness for real-world applications.Open-set object detection (OSOD) is an approach that enables models to detect both known and previously unseen objects, including those not encountered during training. It supports flexible input prompts, ranging from specific object names to open-ended descriptions, and can adapt to user-defined targets in real time without requiring retraining. By combining visual recognition with semantic understanding—often through vision-language models—OSOD helps users query the system broadly, even if it’s unfamiliar, ambiguous, or entirely new.

In this post, we explore how Amazon Bedrock Data Automation uses OSOD to enhance video understanding.

Amazon Bedrock Data Automation and video blueprints with OSOD

Amazon Bedrock Data Automation is a cloud-based service that extracts insights from unstructured content like documents, images, video and audio. Specifically, for video content, Amazon Bedrock Data Automation supports functionalities such as chapter segmentation, frame-level text detection, chapter-level classification Interactive Advertising Bureau (IAB) taxonomies, and frame-level OSOD. For more information about Amazon Bedrock Data Automation, see Automate video insights for contextual advertising using Amazon Bedrock Data Automation.

Amazon Bedrock Data Automation video blueprints support OSOD on the frame level. You can input a video along with a text prompt specifying the desired objects to detect. For each frame, the model outputs a dictionary containing bounding boxes in XYWH format (the x and y coordinates of the top-left corner, followed by the width and height of the box), along with corresponding labels and confidence scores. You can further customize the output based on their needs—for instance, filtering by high-confidence detections when precision is prioritized.

The input text is highly flexible, so you can define dynamic fields in the Amazon Bedrock Data Automation video blueprints powered by OSOD.

Example use cases

In this section, we explore some examples of different use cases for Amazon Bedrock Data Automation video blueprints using OSOD. The following table summarizes the functionality of this feature.

FunctionalitySub-functionalityExamples Multi-granular visual comprehension Object detection from fine-grained object reference"Detect the apple in the video." Object detection from cross-granularity object reference"Detect all the fruit items in the image." Object detection from open questions"Find and detect the most visually important elements in the image." Visual hallucination detection Identify and flag object mentionings in the input text that do not correspond to actual content in the given image."Detect if apples appear in the image."

Ads analysis

Advertisers can use this feature to compare the effectiveness of various ad placement strategies across different locations and conduct A/B testing to identify the most optimal advertising approach. For example, the following image is the output in response to the prompt “Detect the locations of echo devices.”

Smart resizing

By detecting key elements in the video, you can choose appropriate resizing strategies for devices with different resolutions and aspect ratios, making sure important visual information is preserved. For example, the following image is the output in response to the prompt “Detect the key elements in the video.”

Surveillance with intelligent monitoring

In home security systems, producers or users can take advantage of the model’s high-level understanding and localization capabilities to maintain safety, without the need to manually enumerate all possible scenarios. For example, the following image is the output in response to the prompt “Check dangerous elements in the video.”

Custom labels

You can define your own labels and search through videos to retrieve specific, desired results. For example, the following image is the output in response to the prompt “Detect the white car with red wheels in the video.”

Image and video editing

With flexible text-based object detection, you can accurately remove or replace objects in photo editing software, minimizing the need for imprecise, hand-drawn masks that often require multiple attempts to achieve the desired result. For example, the following image is the output in response to the prompt “Detect the people riding motorcycles in the video.”

Sample video blueprint input and output

The following example demonstrates how to define an Amazon Bedrock Data Automation video blueprint to detect visually prominent objects at the chapter level, with sample output including objects and their bounding boxes.

The following code is our example blueprint schema:

blueprint = {  "$schema": "http://json-schema.org/draft-07/schema#",  "description": "This blueprint enhances the searchability and discoverability of video content by providing comprehensive object detection and scene analysis.",  "class": "media_search_video_analysis",  "type": "object",  "properties": {    # Targeted Object Detection: Identifies visually prominent objects in the video    # Set granularity to chapter level for more precise object detection    "targeted-object-detection": {      "type": "array",      "instruction": "Please detect all the visually prominent objects in the video",      "items": {        "$ref": "bedrock-data-automation#/definitions/Entity"      },      "granularity": ["chapter"]  # Chapter-level granularity provides per-scene object detection    },    }}

The following code is out example video custom output:

"chapters": [        .....,        {            "inference_result": {                "emotional-tone": "Tension and suspense"            },            "frames": [                {                    "frame_index": 10289,                    "inference_result": {                        "targeted-object-detection": [                            {                                "label": "man",                                "bounding_box": {                                    "left": 0.6198254823684692,                                    "top": 0.10746771097183228,                                    "width": 0.16384708881378174,                                    "height": 0.7655990719795227                                },                                "confidence": 0.9174646443068981                            },                            {                                "label": "ocean",                                "bounding_box": {                                    "left": 0.0027531087398529053,                                    "top": 0.026655912399291992,                                    "width": 0.9967235922813416,                                    "height": 0.7752640247344971                                },                                "confidence": 0.7712276351034641                            },                            {                                "label": "cliff",                                "bounding_box": {                                    "left": 0.4687306359410286,                                    "top": 0.5707792937755585,                                    "width": 0.168929323554039,                                    "height": 0.20445972681045532                                },                                "confidence": 0.719932173293829                            }                        ],                    },                    "timecode_smpte": "00:05:43;08",                    "timestamp_millis": 343276                }            ],            "chapter_index": 11,            "start_timecode_smpte": "00:05:36;16",            "end_timecode_smpte": "00:09:27;14",            "start_timestamp_millis": 336503,            "end_timestamp_millis": 567400,            "start_frame_index": 10086,            "end_frame_index": 17006,            "duration_smpte": "00:03:50;26",            "duration_millis": 230897,            "duration_frames": 6921        },        ..........]

For the full example, refer to the following GitHub repo.

Conclusion

The OSOD capability within Amazon Bedrock Data Automation significantly enhances the ability to extract actionable insights from video content. By combining flexible text-driven queries with frame-level object localization, OSOD helps users across industries implement intelligent video analysis workflows—ranging from targeted ad evaluation and security monitoring to custom object tracking. Integrated seamlessly into the broader suite of video analysis tools available in Amazon Bedrock Data Automation, OSOD not only streamlines content understanding but also help reduce the need for manual intervention and rigid pre-defined schemas, making it a powerful asset for scalable, real-world applications.

To learn more about Amazon Bedrock Data Automation video and audio analysis, see New Amazon Bedrock Data Automation capabilities streamline video and audio analysis.


About the authors

Dongsheng An is an Applied Scientist at AWS AI, specializing in face recognition, open-set object detection, and vision-language models. He received his Ph.D. in Computer Science from Stony Brook University, focusing on optimal transport and generative modeling.

Lana Zhang is a Senior Solutions Architect in the AWS World Wide Specialist Organization AI Services team, specializing in AI and generative AI with a focus on use cases including content moderation and media analysis. She’s dedicated to promoting AWS AI and generative AI solutions, demonstrating how generative AI can transform classic use cases by adding business value. She assists customers in transforming their business solutions across diverse industries, including social media, gaming, ecommerce, media, advertising, and marketing.

Raj Jayaraman is a Senior Generative AI Solutions Architect at AWS, bringing over a decade of experience in helping customers extract valuable insights from data. Specializing in AWS AI and generative AI solutions, Raj’s expertise lies in transforming business solutions through the strategic application of AWS’s AI capabilities, ensuring customers can harness the full potential of generative AI in their unique contexts. With a strong background in guiding customers across industries in adopting AWS Analytics and Business Intelligence services, Raj now focuses on assisting organizations in their generative AI journey—from initial demonstrations to proof of concepts and ultimately to production implementations.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

亚马逊Bedrock数据自动化 开放集目标检测 视频理解 OSOD 智能视频分析
相关文章