其他分享
首页 > 其他分享> > ICCV2019_Slimmable:(US-Nets)Universally Slimmable Networks and Improved Training Techniques

ICCV2019_Slimmable:(US-Nets)Universally Slimmable Networks and Improved Training Techniques

作者:互联网

Introduction

  最初的Slimmable networks基于预定义的width set切换网络宽度

  => Motivation:can a single neural network run at arbitrary width?作者认为更宽的网络性能不会差于他的slim子网,并且残余误差会处于以下不等式的范围中:

 

 

   其中是指前k个通道聚集的结果是固定的超参数。

 

 

   Challenges:

  First, how to deal with neural networks with batch normalization?(BN设计)

  Second, how to train US-Nets efficiently?(Train)

  Third, compared with training individual networks, what else can we explore in US-Nets to improve overall performance?

   在BN层上存在问题

   First, accumulating independent BN statistics of all sub-networks in a US-Net during training is computationally intensive and inefficient.

  Second, if in each iteration we only update some sampled sub-networks, then these BN statistics are insufficiently accumulated thus inaccurate, leading to much worse accuracy in our experiments.

   对于上述问题,作者的贡献有以下几点:

  (1)训练了可在任意宽度执行的网络

  (2)提出了两个训练技术(the sandwich rule和inplace distillation)

  (3)在图像分类,超分,强化学习进行实验和消融实验

  (4)进一步研究了网络的几个参数:宽度下界K0,宽度除数d,采样宽度的数量,BN后统计子集的大小

  (5)进一步提出每层可以采用单独的宽度比值

  (6)为后续工作铺垫(one-shot architecture search)

Related Work

  Slimmable Networks.

  Knowledge Distilling:Transfer the learned knowledge from a pretrained network to a new one by training it with predicted features, soft-targets or both.

Method

  输出单元特征聚集:其中,n是通道数。fully aggregated feature yn和partially aggregated feature yk 的residual error δ :

  

 

 

   公式(3)说明了slimmable network可以在区间中任意宽度运行(US-Nets),并且概念上有界不等式适用于任何神经网络,与何种BN层无关。

  在当前训练阶段BN层的标准化过程为

 

 

   其中是防止除0的小数值, γ和β学习的尺度和偏置。全局的特征平均值和方差采用移动平均的方法进行更新

 

 

   其中m是动量,t是训练迭代。第T次迭代推理阶段则采用这些全局统计信息。

 

 

   其中γ∗ ,β∗是优化的参数。公式(6)可进一步表示为

 

 

   除了公式(5)移动平均来计算统计信息外还可以采用精确平均:

 

  实际上作者的做法就是对每个宽度在训练后计算BN统计信息(结果写了这么多?就当复习BN的理论了),因为训练集上随机采样的子集可以产生精确的统计信息的估计。

 

   对于训练,作者提出了the sandwich rule和inplace distillation。

  The sandwich rule:训练中,在width multiplier set[0.25, 1.0]×随机采样n-2个子网,然后加上最大和最小子网得到n个子网。采样的子网性能也处于0.25x和1.0x之间。

  the sandwich rule展示了更好的收敛和全局表现性能,优点有:

  1.训练大子网和小子网,观察他们的验证集误差,相当于得到了性能的上界和下界。

  2.大子网的训练对inplace distillation是必要的。

  Inplace Distillation:利用最大子网训练中的预测标签作为其他子网的训练标签,最大子网使用Ground truth。

  图像分类:predicted soft-probabilities by largest width with cross entropy as objective function.

  图像超分:predicted high-resolution patches are used as labels with either `1 or `2 as training objective.

  强化学习:the policy predicted by the model at largest width as roll-outs

  作者还尝试了将预测标签和GT标签结合作为子网的训练标签,但是效果差。

  训练过程:

Experiments

  ImageNet Classification:

 

 

   Image Super-Resolution.

 

 

   作者认为使用了独立模型最优的参数而不是US-Nets最优导致性能变差。

  Deep Reinforcement Learning

Ablation Study

  The Sandwich Rule:

 

 

   the sandwich rule has better performance on average, with good accuracy at both smallest width and largest width.

  训练小的比训练大的更重要.

  Inplace Distillation:

 

 

   Post-Statistics of Batch Normalization:

   Width Lower Bound

 

 

  Width Divisor d :MobileNets中floor the channel number approximately as

 

   Number of Sampled Widths Per Iteration n

 

 

评价:本文是针对ICLR19的slimmable中fixed width的坑,更多是说明网络可以根据任意宽度执行。the sandwich rule和inplace distillation给后续工作提供一个训练的思路。

 

标签:Slimmable,Training,子网,Improved,训练,BN,US,width,networks
来源: https://www.cnblogs.com/huang-hz/p/16647511.html