Few-shot action recognition predicts new classes without labels and has received widespread attention for practical systems. The skeleton is a sparse representation of human actions, and existing spatiotemporal based models by training a strong encoder network could make the skeleton graph very dense with edges, which may lead to the over-smoothing problem. To address this issue, we propose the Spatio-Temporal Aggregation Transformer Network (STAT-Net) as a general backbone for skeleton-based few-shot action recognition. In the spatiotemporal aggregation transformer modules, the spatial multi-head self attention for modeling the connection of different joints in the same frame, while the temporal multi-head self attention for modeling the skeleton sequence between two adjacent frames. The extracted features between the three parts are aggregated by Adaptive Fusion technique to obtain a high dimensional embedding. Extensive experiments on two benchmarks demonstrate that our proposed model achieves better recognition results com- pared with other existing methods.
|