Neural Network Compression Framework for fast model inference
作者:互联网
论文背景
- Alexander Kozlov Ivan Lazarevich Vasily Shamporov Nikolay Lyalyushkin Yury Gorbachev
intel
名字看起来都是俄罗斯人
- 期刊/会议: CVPR 2020
Abstract
基于pytorch框架, 可以提供quantization, sparsity, filter pruning and binarization等压缩技术. 可独立使用, 也可以与现有的training code整合在一起.
features
- Support of quantization, binarization, sparsity and filter pruning algorithms with fine-tuning.
- Automatic model graph transformation in PyTorch – the model is wrapped and additional layers are inserted in the model graph.
- Ability to stack compression methods and apply several of them at the same time.
- Training samples for image classification, object detection and semantic segmentation tasks as well as configuration files to compress a range of models.
- Ability to integrate compression-aware training into third-party repositories with minimal modifications of the existing training pipelines, which allows integrating NNCF into large-scale model/pipeline aggregation repositories such as MMDetection or Transformers.
- Hardware-accelerated layers for fast model fine-tuning and multi-GPU training support.
- Compatibility with O p e n V I N O T M OpenVINO^{TM} OpenVINOTM Toolkit for model inference.
A few caveats and Framework Architecture
- NNCF does not perform additional network graph transformations during the quantization process, such as batch normalization folding
- The sparsity algorithms implemented in NNCF constitute non-structured network sparsification approaches. Another approach is the so-called structured sparsity, which aims to prune away whole neurons or convolutional filters.
- Each compression method acts on this wrapper by defining the following basic components:
- Compression Algorithm Builder
- Compression Algorithm Controller
- Compression Loss
- Compression Scheduler
- Another important novelty of NNCF is the support of algorithm stacking where the users can build custom compression pipelines by combining several compression methods.(可以在一次训练中同时生成稀疏且量化的模型)
- 使用步骤
- the model is wrapped by the transparent NNCFNetwork wrapper
- one or more particular compression algorithm builders are instantiated and applied to the wrapped model.
- The wrapped model can then be fine-tuned on the target dataset using either an original training pipeline, or a slightly modified pipeline.
- After the compressed model is trained we can export it to ONNX format for further usage in the O p e n V I N O T M OpenVINO^{TM} OpenVINOTM inference toolkit
Compression Methods Overview
quantization
借鉴的方法有
- QAT
- PACT
- TQT
q m i n q_{min} qmin | q m a x q_{max} qmax | |
---|---|---|
Weights | − 2 b i t s − 1 + 1 -2^{bits-1}+1 −2bits−1+1 | 2 b i t s − 1 − 1 2^{bits-1}-1 2bits−1−1 |
Signed Activation | − 2 b i t s − 1 -2^{bits-1} −2bits−1 | 2 b i t s − 1 − 1 2^{bits-1}-1 2bits−1−1 |
Unsigned Activation | 0 | 2 b i t s − 1 2^{bits}-1 2bits−1 |
对称量化
scale是训练得到的, 用以表示实际的范围
非对称量化
训练优化float的范围, 0点为最小是
float zero-point经过映射后需要是在量化范围内的一个整数, 这个限制可以使带padding的layer计算效率高
Training and inference
和QAT, TQT不同, 论文中的方法并不会进行BN fold, 但是为了train和inference时的统计量一致, 需要使用大的batch size.(>256)
混合精度量化
使用HAWQ-v2方法来选择bit位,
敏感度计算方式如下:
压缩率计算方式: int8的复杂度/mixed-precision复杂度
复杂度 = FLOPs * bit-width
混合精度就是在满足压缩率阈值的情况下, 找到具有最小敏感度的精度配置.
Binarization
weights通过XNOR和DoReFa实现.
- Stage 1: the network is trained without any binarization,
- Stage 2: the training continues with binarization enabled for activations only,
- Stage 3: binarization is enabled both for activations and weights,
- Stage 4: the optimizer learning rate, which had been kept constant at previous stages, is decreased according to a polynomial law, while weight decay parameter of the optimizer is set to 0.
Sparsity
NNCF支持两只sparsity方式:
1 根据weights大小来训练
2 基于L0 regularization的训练
Filter pruning
NNCF implements three different criteria for filter importance:
- L1-norm,
- L2-norm
- geometric median.
标签:training,inference,compression,Neural,NNCF,sparsity,model,Compression 来源: https://blog.csdn.net/xieyi4650/article/details/118393883