首页 > 其他分享> > Neural Network Compression Framework for fast model inference

Neural Network Compression Framework for fast model inference

2021-07-08 16:01:15 作者：互联网

论文背景

Alexander Kozlov Ivan Lazarevich Vasily Shamporov Nikolay Lyalyushkin Yury Gorbachev
intel

名字看起来都是俄罗斯人

基于pytorch框架, 可以提供quantization, sparsity, filter pruning and binarization等压缩技术. 可独立使用, 也可以与现有的training code整合在一起.

Support of quantization, binarization, sparsity and filter pruning algorithms with fine-tuning.
Automatic model graph transformation in PyTorch – the model is wrapped and additional layers are inserted in the model graph.
Ability to stack compression methods and apply several of them at the same time.
Training samples for image classification, object detection and semantic segmentation tasks as well as configuration files to compress a range of models.
Ability to integrate compression-aware training into third-party repositories with minimal modifications of the existing training pipelines, which allows integrating NNCF into large-scale model/pipeline aggregation repositories such as MMDetection or Transformers.
Hardware-accelerated layers for fast model fine-tuning and multi-GPU training support.
Compatibility with O p e n V I N O T M OpenVINO^{TM} OpenVINOTM Toolkit for model inference.

NNCF does not perform additional network graph transformations during the quantization process, such as batch normalization folding
The sparsity algorithms implemented in NNCF constitute non-structured network sparsification approaches. Another approach is the so-called structured sparsity, which aims to prune away whole neurons or convolutional filters.
Each compression method acts on this wrapper by defining the following basic components:
- Compression Algorithm Builder
- Compression Algorithm Controller
- Compression Loss
- Compression Scheduler
Another important novelty of NNCF is the support of algorithm stacking where the users can build custom compression pipelines by combining several compression methods.(可以在一次训练中同时生成稀疏且量化的模型)
使用步骤
- the model is wrapped by the transparent NNCFNetwork wrapper
- one or more particular compression algorithm builders are instantiated and applied to the wrapped model.
- The wrapped model can then be fine-tuned on the target dataset using either an original training pipeline, or a slightly modified pipeline.
- After the compressed model is trained we can export it to ONNX format for further usage in the O p e n V I N O T M OpenVINO^{TM} OpenVINOTM inference toolkit

借鉴的方法有

scale是训练得到的, 用以表示实际的范围
在这里插入图片描述

训练优化float的范围, 0点为最小是
float zero-point经过映射后需要是在量化范围内的一个整数, 这个限制可以使带padding的layer计算效率高
在这里插入图片描述

和QAT, TQT不同, 论文中的方法并不会进行BN fold, 但是为了train和inference时的统计量一致, 需要使用大的batch size.(>256)

使用HAWQ-v2方法来选择bit位,
敏感度计算方式如下:
在这里插入图片描述

压缩率计算方式: int8的复杂度/mixed-precision复杂度
复杂度 = FLOPs * bit-width

混合精度就是在满足压缩率阈值的情况下, 找到具有最小敏感度的精度配置.

weights通过XNOR和DoReFa实现.

Stage 1: the network is trained without any binarization,
Stage 2: the training continues with binarization enabled for activations only,
Stage 3: binarization is enabled both for activations and weights,
Stage 4: the optimizer learning rate, which had been kept constant at previous stages, is decreased according to a polynomial law, while weight decay parameter of the optimizer is set to 0.

NNCF支持两只sparsity方式:
1 根据weights大小来训练
2 基于L0 regularization的训练

NNCF implements three different criteria for filter importance:

标签：training,inference,compression,Neural,NNCF,sparsity,model,Compression
来源： https://blog.csdn.net/xieyi4650/article/details/118393883