其他分享
首页 > 其他分享> > Weakly-Supervised Discovery of Geometry-Aware Representation for 3D Human Pose Estimation CVPR 2019

Weakly-Supervised Discovery of Geometry-Aware Representation for 3D Human Pose Estimation CVPR 2019

作者:互联网

在这里插入图片描述

gold:

learn a geometry-aware 3D representation G\mathcal{G}G for the human pose

discover the geometry relation between paired images (Iti,Itj)( I^i_t,I^j_t)(Iti​,Itj​)
which are acquired from synchronized and calibrated cameras

main components

image-skeleton mapping

input raw image pair (Iti,Itj)( I^i_t,I^j_t)(Iti​,Itj​) with size of W×HW\times HW×H

Cti,CtjC^i_t,C^j_tCti​,Ctj​:KKK keypoint heatmaps from a pre-trained 2D human pose eastimator

We follow previous works [45, 20, 18] to train the 2D estimator on MPII dataset.

constructed 8 pixels width 2D skeleton maps from heatmaps

binary skeleton maps pair (Sti,Stj)( S^i_t,S^j_t)(Sti​,Stj​),St(){0,1}(K1)×W×HS_t^{(\cdot)}\in\{0,1\}^{(K-1)\times W \times H}St(⋅)​∈{0,1}(K−1)×W×H

Geometry representation via view synthesis

training set T={(Sti,Stj,Rij)}\mathcal T = \{(S^i_t,S^j_t,R_{i\to j})\}T={(Sti​,Stj​,Ri→j​)}

Straightforward way for learning representation in unsupervised/weakly-supervised manner is to utilize auto-encoding mechanism reconstrcting input image.

novel ‘skeleton-based view synthesis’ generate image under a new viewpoint.
Given an image under the known viewpoint as input

source domain : (input image) Si={Sti}i=1V\mathcal S^i = \{S^i_t\}^V_{i=1}Si={Sti​}i=1V​

target domain :generate image Sj={Stj}i=1V\mathcal S^j = \{S^j_t\}^V_{i=1}Sj={Stj​}i=1V​

encoder ϕ:SiG\phi:\mathcal S^i \to \mathcal Gϕ:Si→G

decoder ψ:Rij×GSi\psi:\mathcal R_{i\to j} \times \mathcal G \to \mathcal S_iψ:Ri→j​×G→Si​

L2(ϕψ,θ)=1NTψ(Rij×ϕ(Sti))StjL_{\ell2}(\phi \cdot\psi,\theta)=\frac{1}{N^T}\sum\lVert\psi(R_{i\to j}\times\phi(S^i_t))-S^j_t\rVertLℓ2​(ϕ⋅ψ,θ)=NT1​∑∥ψ(Ri→j​×ϕ(Sti​))−Stj​∥

Representation consistency constraint

image skeleton mapping + view synthesis (previous two steps)lead to unrealistic generation on target pose when there are large self occlusions in source view

Since there is no explicit constraint on latent space to facilitate G\mathcal GG to be semantic

We assume there exists an inverse mapping (one-to-one) between source domain and target domain, on the condition of the known relative rotation matrix. We could find:

a encoder μ:SjG\mu:\mathcal S^j \to \mathcal Gμ:Sj→G

a decoder ν:Rji×GSti\nu:R_{j\to i}\times G \to S^i_tν:Rj→i​×G→Sti​

lrc=m=1Mf×GiG~i2l_{rc}=\sum_{m=1}^M\lVert f \times G_i -\tilde G_i \rVert^2lrc​=m=1∑M​∥f×Gi​−G~i​∥2

total loss of the bidirectional model

where θ and ζ denotes the parameters of two encode-decoder networks, respectively.

3D human pose estimation by learnt representation

given: monocular image III
goal : b={(xp,yp,zp)}p=1P\bold b = \{(x^p,y^p,z^p)\}^P_{p=1}b={(xp,yp,zp)}p=1P​,P body joints,bB\bold b \in \mathcal Bb∈B

function F:IB\mathcal F:\mathcal I \to \mathcal BF:I→B to learn the pose regression

2 fully connect layer

标签:Weakly,skeleton,Geometry,image,Sti,Stj,mathcal,Estimation,Ri
来源: https://blog.csdn.net/qq_38682032/article/details/88957121