Video action recognition methods based on deep learning can be divided into two types: two-dimensional convolutional networks (2D-ConvNets) relied and three-dimensional convolutional networks (3D-ConvNets) relied. 2D-ConvNets are more efficient to learn spatial features, but cannot capture temporal relationships directly. 3D-ConvNets can jointly learn spatial–temporal features, but their learning is time-consuming because of a large number of networks’ parameters. We therefore propose an effective spatial–temporal interaction (STI) module. The 2D spatial convolution and the one-dimensional temporal convolution are combined through attention mechanism in STI to learn the spatial–temporal information effectively and efficiently. The computation cost of the proposed method is far less than 3D convolution. The proposed STI module can be combined with 2D-ConvNets to obtain the effect of 3D-ConvNets with far fewer parameters, and it can also be inserted into 3D-ConvNets to improve their ability to learn spatial–temporal features, so as to improve the recognition accuracy. Experimental results show that the proposed method outperforms the existing counterparts on benchmark datasets.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.