KEYWORDS: Video coding, Video, Pose estimation, 3D modeling, Education and training, 3D video compression, Transformers, Motion models, Data modeling, Windows
This study introduces a novel framework for video-based 3D human pose and shape estimation, termed Selective sampling and Temporal Positional Encoding (STPE). Our method leverages selective sampling and advanced positional encoding to tackle the temporal complexities of video data and the high cost and scarcity of annotated datasets. Inspired by the Masked Autoencoder (MAE), our approach adopts a selective sampling strategy that efficiently captures the essential dynamics of human motion from partial views, significantly reducing reliance on continuous frames. The framework incorporates Rotary Position Embedding (RoPE), using rotational angles to simplify positional encoding. This innovation decreases model complexity and boosts learning effectiveness. We also introduce randomized index positions during training, introducing variability and enhancing generalization across various datasets and motion patterns. Our model, validated on standard datasets like 3DPW, MPI-INF-3DHP, and Human3.6M, shows enhanced performance in accurate and robust 3D pose and shape capture compared to existing methods. Our results demonstrate that strategic frame sampling and sophisticated positional encoding can significantly improve accuracy and robustness of video-based pose estimation systems.
KEYWORDS: Performance modeling, Databases, Systems modeling, Communication engineering, Data modeling, Telecommunications, Computing systems, Roads, Computer engineering, Information technology
Attention based bidirectional long short-term memory networks have been increasingly concerned and widely used in Natural Language Processing tasks. Motivated by the performance of attention mechanism, various attentive models have been proposed to prompt the effectiveness of question answering. However, there are few researches that have focused on the impact of positional information on question answering, which has been proved effective in information retrieval. In this paper, we assume that if a word appears both in the question sentence and answer sentence, words close to it should be paid more attention to, since they are more likely to contain potential valuable information for the question. Moreover, there also has few researches that consider part-of-speech into question answering. We argue that words except nouns, verbs and pronouns tend to contain less useful information than nouns, verbs and pronouns, so that we can neglect the positional impact of them. Based on both assumptions above, we propose a part-of-speech and position attention mechanism based bidirectional long short-term memory networks for question answering system, abbreviated in DPOS-ATT-BLSTM, which cooperates with traditional attention mechanism to obtain attentive answer representations. We experiment on the Chinese medicinal dataset collected from the http://www.xywy.com/ and http://www.haodf.com/, and comparative experiments are made comparing with methods based on traditional attention mechanism. The experimental results demonstrate the good performance and efficiency of our proposed model.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.