2D CNNS for video-based action modeling ignore the temporal information and treat the multiple frames analogously to channels. In view of this, a mixed convolution structure implemented with ResNet-18 residual network is designed for video feature extracting. The 3D convolution and the (2+1)D convolution are interleaved in sequence throughout the network. Firstly, 2D convolution is performed on input multiple video frames one by one in the spatial. Then, 1D convolution of temporal is performed on the output of 2D convolution. Finally, 3D convolution is performed for spatiotemporal modeling simultaneously. Results show that the mixed convolution structure enhances the transmission of temporal information, improves the ability of video feature extraction and the action recognition accuracy obviously
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.