首页 > 其他分享> > [论文阅读 2019 ICCV-oral 目标跟踪]Learning Discriminative Model Prediction for Tracking

[论文阅读 2019 ICCV-oral 目标跟踪]Learning Discriminative Model Prediction for Tracking

2021-04-02 20:32:08 作者：互联网

简介

paper:Learning Discriminative Model Prediction for Tracking

Martin持续发力的经典之作Dimp。这篇论文的动机是：首先，当前的siamese跟踪器只重视target的特征而忽略了background信息；其次，当前的siamese跟踪器是离线训练的，而跟踪任务跟踪的目标大部分情况是训练集中所没有的，这就导致算法的在线跟踪时的不可靠；同时，当前大多数SOTA的跟踪器都采用很简单的模板更新策略，导致算法是不鲁棒的。

基于当前跟踪器存在的问题，这篇论文在ATOM的基础上提出了Dimp，一个更加强大的跟踪模型!

如下图所示，近些年来流行的siamese结构算法通常只使用target feature(crop处理)且通常不用online train.

在这里插入图片描述

主要内容

在这里插入图片描述

Dimp主要在ATOM的基础上对target classification部分进行了改进优化，如上图所示是Dimp的target classification部分.

Discriminative Learning Loss

target classification中的关键是Model predictor D,而Model predictor D采用online train的方式来更新从而使得模型更加可靠。

在这里插入图片描述

为此，对于target classification部分，论文提出了以下损失函数:

L ( f ) = 1 ∣ S train ∣ ∑ ( x , c ) ∈ S train ∥ r ( x ∗ f , c ) ∥ 2 + ∥ λ f ∥ 2 L(f)=\frac{1}{\left|S_{\text {train }}\right|} \sum_{(x, c) \in S_{\text {train }}}\|r(x * f, c)\|^{2}+\|\lambda f\|^{2} L(f)=∣Strain ∣1(x,c)∈Strain ∑∥r(x∗f,c)∥2+∥λf∥2

where f = D ( S t r a i n ) f=D(S_{train}) f=D(Strain), ∗ * ∗ denotes convolution and λ λ λ is a regularization factor.The function r ( s , c ) r(s, c) r(s,c) computes the residual at every spatial location based on the target confidence scores s = x ∗ f s = x*f s=x∗f and
the ground-truth target center coordinate c c c.

对于残差函数 r ( x ∗ f , c ) r(x*f,c) r(x∗f,c)的选择，论文中认为采用简单的 r ( x ∗ f , c ) = x ∗ f − y c r(x*f,c)=x*f-y_c r(x∗f,c)=x∗f−yc(where y c y_c yc are the desired target scores as at each location,popularly set to a Gaussian function centered at c c c),这样简单的残差，使得模型将关注重点放在负样本上（因为高斯标签只有少部分是值较大的），而导致学习到的不是最佳模型。

为此，这篇论文从SVM中收到启发，在残差函数中使用hinge-like loss,定义的残差函数如下:

r ( s , c ) = v c ⋅ ( m c s + ( 1 − m c ) max ⁡ ( 0 , s ) − y c ) r(s, c)=v_{c} \cdot\left(m_{c} s+\left(1-m_{c}\right) \max (0, s)-y_{c}\right) r(s,c)=vc⋅(mcs+(1−mc)max(0,s)−yc)

Here, the target mask m c m_c mc, the spatial weight v c v_c vc, the regularization factor λ λ λ, and the regression target y c y_c yc

其中, m c m_c mc, v c v_c vc, λ \lambda λ和 y c y_c yc这些参数都是可以通过在线学习得到，在论文的 3.4 3.4 3.4节有详细描述，之后我也会稍微介绍一下.

Optimization-Based Architecture

前面我们已经介绍了target classification的损失函数，通过最小化这个损失函数就可以得到最优的filter f.

最直接的优化损失函数的方法就是采用梯度下降法，用公式可以表示为:

f ( i + 1 ) = f ( i ) − α ∇ L ( f ( i ) ) f^{(i+1)}=f^{(i)}-\alpha \nabla L\left(f^{(i)}\right) f(i+1)=f(i)−α∇L(f(i))

Martin大神认为采用梯度下降法会使得模型收敛很慢，而收敛很慢的原因是梯度下降中采用了固定的步长，而不是根据当前数据或模型评估结果进行动态调整。为此，Martin大神通过最速梯度算法来迭代优化得到一个比较理想的filter f.(具体见论文3.2)

Initial Filter Prediction

在Model predictor D中还有一个Model initialier模块，这个模块由一个卷积层后面紧跟一个precise ROI pooling结构组成，这个模块仅负责提供合理的初始估计值，而不是预测最终模型，最终模型由Model optimizer提供.

However, rather than predicting the final model, our initializer network is tasked with only providing a reasonable initial estimate, which is then processed by the optimizer module to provide the final model.

在这里插入图片描述

Learning the Discriminative Learning Loss

前面我们提到在残差函数 r ( s , c ) r(s,c) r(s,c)中 m c m_c mc, v c v_c vc, λ \lambda λ和 y c y_c yc都是可以通过学习得到的，而在之前的跟踪算法中这些一般都是人为设计好的。

这篇论文以回归目标 y c y_c yc为例进行了说明，一般情况下 y c y_c yc会认为设置为高斯函数型的标签，这篇论文将其定义为如下：

y c ( t ) = ∑ k = 0 N − 1 ϕ k y ρ k ( ∥ t − c ∥ ) . y_{c}(t)=\sum_{k=0}^{N-1} \phi_{k}^{y} \rho_{k}(\|t-c\|) . yc(t)=k=0∑N−1ϕkyρk(∥t−c∥).

ρ k ( d ) = { max ⁡ ( 0 , 1 − ∣ d − k Δ ∣ Δ ) , k < N − 1 max ⁡ ( 0 , min ⁡ ( 1 , 1 + d − k Δ Δ ) ) , k = N − 1 \rho_{k}(d)=\left\{\begin{array}{ll} \max \left(0,1-\frac{|d-k \Delta|}{\Delta}\right), & k<N-1 \\ \max \left(0, \min \left(1,1+\frac{d-k \Delta}{\Delta}\right)\right), & k=N-1 \end{array}\right. ρk(d)={max(0,1−Δ∣d−kΔ∣),max(0,min(1,1+Δd−kΔ)),k<N−1k=N−1

其中 ρ k ( d ) \rho_{k}(d) ρk(d), ∥ t − c ∣ ∣ \|t-c|| ∥t−c∣∣都是计算得到的实际值，真正需要学习的参数只有 ϕ k y \phi_{k}^{y} ϕky,简单说就是通过训练学习一个比较理想的参数 ϕ k y \phi_{k}^{y} ϕky.

Bounding Box Estimation

在这里插入图片描述

对于target estimation部分则采用ATOM中的Iou-Net结果,如上图所示(具体可以参考ATOM)。

Offline Training

离线训练时，将Feature extractor部分，target classification部分和target estimation部分当成一个整体进行训练。

其中target classification部分的分类损失函数定义如下(具体参考原文):

L c l s = 1 N i t e r ∑ i = 0 N i t e r ∑ ( x , c ) ∈ S t e s t ∥ ℓ ( x ∗ f ( i ) , z c ) ∥ 2 L_{\mathrm{cls}}=\frac{1}{N_{\mathrm{iter}}} \sum_{i=0}^{N_{\mathrm{iter}}} \sum_{(x, c) \in S_{\mathrm{test}}}\left\|\ell\left(x * f^{(i)}, z_{c}\right)\right\|^{2} Lcls=Niter1i=0∑Niter(x,c)∈Stest∑∥∥∥ℓ(x∗f(i),zc)∥∥∥2

ℓ ( s , z ) = { s − z , z > T max ⁡ ( 0 , s ) , z ≤ T \ell(s, z)=\left\{\begin{array}{ll} s-z, & z>T \\ \max (0, s), & z \leq T \end{array}\right. ℓ(s,z)={s−z,max(0,s),z>Tz≤T

而对于target estimation部分则采用预测的bbox与ground truth之间的IOU loss.

最终总损失定义为:

L t o t = β L c l s + L b b L_{\mathrm{tot}}=\beta L_{\mathrm{cls}}+L_{\mathrm{bb}} Ltot=βLcls+Lbb

Online Tracking

在线跟踪时，首先通过target classification对目标中心进行定位;之后，通过target estimation预测目标的bbox。

实验结果

在这里插入图片描述

小结

Martin大神的经典神作，需要好好专研，不得不佩服Martin大神的数学功底，tql!

标签：Tracking,target,classification,max,yc,Prediction,right,Learning,mathrm
来源： https://blog.csdn.net/qq_39621037/article/details/115406985