Paper
18 November 2024 Optimizing action segmentation with linear self-attention and pyramidal pooling
Jiamin Fu, Zhihong Chen, Haiwei Zhang, Yuxuan Gao
Author Affiliations +
Proceedings Volume 13403, International Conference on Algorithms, High Performance Computing, and Artificial Intelligence (AHPCAI 2024) ; 134032H (2024) https://doi.org/10.1117/12.3051722
Event: International Conference on Algorithms, High Performance Computing, and Artificial Intelligence, 2024, Zhengzhou, China
Abstract
Continuous action segmentation is a challenging task in video semantic understanding, aims to temporally segment unedited long videos. Current state-of-the-art methods combine time-domain convolution with self-attention mechanisms to capture temporal correlations, achieving high-accuracy frame-level classification and reducing over-segmentation during prediction. However, these models rely on multiple decoding modules and complex self-attention mechanisms, which, while improving frame accuracy, incur high computational costs, particularly when dealing with long video sequences. To address this issue, we propose a novel hybrid model that optimizes the temporal convolutional model with a simplified extended linear self-attention layer. This design enables the model to focus more effectively on action changes at key moments in a video sequence while significantly reducing computational complexity. Furthermore, we introduce an attention enhancement module with a pyramidal pooling structure to improve the model's ability to capture actions at multiple scales. By leveraging these innovations, our hybrid model achieves a better balance between accuracy and computational efficiency, making it more suitable for real-world applications involving long video sequences. The experimental results demonstrate that our model achieves an accuracy of 86.0% on the 50Salads dataset and 84.6% on the GTEA dataset, outperforming existing techniques such as MS-TCN++ and ASFormer in terms of both accuracy and computational efficiency. Notably, our model requires only 0.98M parameters, a significant reduction compared to ASFormer's parameter count. These advantages make our model particularly well-suited for practical applications involving long video sequences, where computational efficiency and accuracy are crucial.
(2024) Published by SPIE. Downloading of the abstract is permitted for personal use only.
Jiamin Fu, Zhihong Chen, Haiwei Zhang, and Yuxuan Gao "Optimizing action segmentation with linear self-attention and pyramidal pooling", Proc. SPIE 13403, International Conference on Algorithms, High Performance Computing, and Artificial Intelligence (AHPCAI 2024) , 134032H (18 November 2024); https://doi.org/10.1117/12.3051722
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Video

Convolution

Video surveillance

Performance modeling

Data modeling

Mathematical optimization

Image segmentation

RELATED CONTENT


Back to Top