A Graph Attention Spatio-temporal Convolutional Network for 3D Human Pose Estimation in Video
Refereed conference paper presented and published in conference proceedings
CUHK Authors
Other information
AbstractSpatio-temporal information is key to resolve oc-
clusion and depth ambiguity in 3D human pose estimation.
Previous methods have focused on either temporal contexts
or local-to-global architectures that embed fixed-length spatio-
temporal information. To date, there have not been effective
proposals to simultaneously and flexibly capture varying spatio-
temporal sequences and effectively achieves real-time 3D human
pose estimation. In this work, we improve the learning of
kinematic constraints in the human skeleton: posture, local
kinematic connections, and symmetry by modeling local and
global spatial information via attention mechanisms. To adapt
to single- and multi-frame estimation, the dilated temporal
model is employed to process varying skeleton sequences. Also,
importantly, we carefully design the interleaving of spatial
semantics with temporal dependencies to achieve a synergistic
effect. To this end, we propose a simple yet effective graph
attention spatio-temporal convolutional network (GAST-Net)
that comprises of interleaved temporal convolutional and graph
attention blocks. Experiments on two challenging benchmark
datasets (Human3.6M and HumanEva-I) and YouTube videos
demonstrate that our approach effectively mitigates depth
ambiguity and self-occlusion, generalizes to half upper body
estimation, and achieves competitive performance on 2D-to-
3D video pose estimation. Code, video, and supplementary
information is available at: http://www.juanrojas.net/gast/
clusion and depth ambiguity in 3D human pose estimation.
Previous methods have focused on either temporal contexts
or local-to-global architectures that embed fixed-length spatio-
temporal information. To date, there have not been effective
proposals to simultaneously and flexibly capture varying spatio-
temporal sequences and effectively achieves real-time 3D human
pose estimation. In this work, we improve the learning of
kinematic constraints in the human skeleton: posture, local
kinematic connections, and symmetry by modeling local and
global spatial information via attention mechanisms. To adapt
to single- and multi-frame estimation, the dilated temporal
model is employed to process varying skeleton sequences. Also,
importantly, we carefully design the interleaving of spatial
semantics with temporal dependencies to achieve a synergistic
effect. To this end, we propose a simple yet effective graph
attention spatio-temporal convolutional network (GAST-Net)
that comprises of interleaved temporal convolutional and graph
attention blocks. Experiments on two challenging benchmark
datasets (Human3.6M and HumanEva-I) and YouTube videos
demonstrate that our approach effectively mitigates depth
ambiguity and self-occlusion, generalizes to half upper body
estimation, and achieves competitive performance on 2D-to-
3D video pose estimation. Code, video, and supplementary
information is available at: http://www.juanrojas.net/gast/
Acceptance Date28/02/2021
All Author(s) ListJunfa Liu, Juan Rojas, Yihui Li, Zhijun Liang, Yisheng Guan, Ning Xi, Haifei Zhu
Name of ConferenceIEEE International Conference on Robotics and Automation (ICRA)
Start Date of Conference30/05/2021
End Date of Conference05/06/2021
Place of ConferenceXian, China
Country/Region of ConferenceChina
Year2021
Month6
PublisherIEEE
Place of PublicationXian
LanguagesEnglish-United States
Keywordsaction classification, graph attention, robotic interaction