动作识别：结合层次结构和文本信息的新方法

cs.AI updates on arXiv.org 7小时前

动作识别：结合层次结构和文本信息的新方法

本文提出一种结合动作层次结构和文本信息（如位置和先前动作）的动作识别新方法。采用定制化的transformer架构，结合视觉和文本特征，并定义联合损失函数进行粗粒度和细粒度动作识别。实验表明，该方法在多个数据集上均优于现有技术。

arXiv:2410.21275v2 Announce Type: replace-cross Abstract: We propose a novel approach to improve action recognition by exploiting the hierarchical organization of actions and by incorporating contextualized textual information, including location and previous actions, to reflect the action's temporal context. To achieve this, we introduce a transformer architecture tailored for action recognition that employs both visual and textual features. Visual features are obtained from RGB and optical flow data, while text embeddings represent contextual information. Furthermore, we define a joint loss function to simultaneously train the model for both coarse- and fine-grained action recognition, effectively exploiting the hierarchical nature of actions. To demonstrate the effectiveness of our method, we extend the Toyota Smarthome Untrimmed (TSU) dataset by incorporating action hierarchies, resulting in the Hierarchical TSU dataset, a hierarchical dataset designed for monitoring activities of the elderly in home environments. An ablation study assesses the performance impact of different strategies for integrating contextual and hierarchical data. Experimental results demonstrate that the proposed method consistently outperforms SOTA methods on the Hierarchical TSU dataset, Assembly101 and IkeaASM, achieving over a 17% improvement in top-1 accuracy.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

动作识别层次结构文本信息 transformer架构动作分类

相关文章

MIT Researchers Propose Cross-Layer Attention (CLA): A Modification to the Transformer Architecture that Reduces the Size of the Key-Value KV Cache by Sharing KV Activations Across Layers

Hierarchical Graph Masked AutoEncoders (Hi-GMAE): A Novel Multi-Scale GMAE Framework Designed to Handle the Hierarchical Structures within Graph

A Decade of Transformation: How Deep Learning Redefined Stereo Matching in the Twenties

LLM for Biology: This Paper Discusses How Language Models can be Applied to Biological Research

TWLV-I: A New Video Foundation Model that Constructs Robust Visual Representations for both Motion and Appearance-based Videos

博士论文 | Stanford 2023 | 通过结构和功能层次创建和编辑视觉内容 136页

Nat. Commun. | 利用transformer模型将质谱数据序列翻译成肽段序列

KnowFormer: A Transformer-Based Breakthrough Model for Efficient Knowledge Graph Reasoning, Tackling Incompleteness and Enhancing Predictive Accuracy Across Large-Scale Datasets

Researchers from MIT and Peking University Introduce a Self-Correction Mechanism for Improving the Safety and Reliability of Large Language Models

Researchers from UC Berkeley Present UnSAM in Computer Vision: A New Paradigm for Segmentation with Minimal Data, Achieving State-of-the-Art Results Without Human Annotation