其他分享
首页 > 其他分享> > 【Paper Reading】文章读后总结:2014年《An Energy-Efficient Precision-Scalable ConvNet Processor in 40-nm CMOS》

【Paper Reading】文章读后总结:2014年《An Energy-Efficient Precision-Scalable ConvNet Processor in 40-nm CMOS》

作者:互联网

DaDianNao: A Machine-Learning Supercomputer [2014] 21-04-29阅

-1 感悟

感悟就是,已经接受了我是个垃圾制造机… T _ T…

0 ABSTRACT

​ Considering that the various applications of AI algorithms showing up increasingly, there are proposed a number of neural network accelerators for higher computational capacity/area ratio but limited by memory accesses. This paper proposes a customized architecture for machine learning with multiple chips, which achieves a speedup of 450.65x over a GPU and reducing the energy by 150.31x on average for a 64-chip system on some largest networks.

1 RELATED WORK

​ Temam [2] proposed a neural network but not a DNNs accelerator for multi-layer perceptrons. Esmaeilzadeh et al. [3] proposed a NPU for approximating program function via a hardware neural network, which is not dedicated for machine learning. Chen et al. [4] an accelerator for DNNs. However, there accelerators are all limited with the size of neural network and the storage of NNs’ computing values. Meanwhile, Chen et al. [4] ensured the phenomenon of the bottleneck of memory access in neural network accelerators.

2 STATE-OF-THE-ART MACHINE-LEARNING TECHNIQUES

2.1 Main Layer Types

​ Consisted of four types of layers: pooling layers (POOL), convolutional layers (CONV), classifier layers (CLASS) and local response normalization layers (LRN), both CNNs and DNNs achieve effective classification for the output.

2.2 Benchmarks

​ This article uses 10 of the largest known layers of each type and a full CNN from the ImageNet 2012 competition as benchmarks. The details of configurations of each layers are recorded in the paper.

3 THE GPU OPTION

​ The paper evaluates the performance of different layer types mentioned as above in CUDA with a GPU (NVIDIA K20M, 5GB GDDR5, 208 GB/s memory bandwidth, 3.52 TFlops peak, 28nm technology) and a 256-bit SIMD CPU (Intel Xeon E5-4620 Sandy Bridge-EP, 2.2GHz, 1TB memory). According to the analysis, it shows that GPUs own great efficiency on LRN layers due to the feature of SIMD. However, the drawbacks of GPUs are obvious for their high cost, incompatibility with industrial applications and moderate energy efficiency.

4 THE ACCELERA TOR OPTION

​ Chen et al. [5] proposed the DianNao accelerator for faster and better energy efficiency in computation of large CNNs and DNNs, which consists of buffers for input/output neurons and synapses, and a NFU. According to the reproduction of Chen’s article, it finds that the main limitations of Chen’s architecture is the bottleneck of memory bandwidth in the convolutional layers and classifier layers, which is thus the optimization goal of this article.

5 A MACHINE-LEARNING SUPERCOMPUTER

​ In this part, the paper proposes an architecture for high performance in machine learning via cheaper multiple chips than typical GPU, whose on-chip storage is fully enough for the memory needs in DNNs or CNNs.

5.1 Overview

​ For solving the requirements of memory storage and bandwidth, the paper decides that:

5.2 Node

5.3 Interconnect

5.4 Programming, Code Generation and Multi-Node Mapping

6 METHODOLOGY

6.1 Measurements

6.2 Baseline

​ The paper uses the CUDA versions from a tuned open-source as the baseline for maximizing the quality.

7 EXPERIMENTAL RESULTS

8 REFERENCE

[1]Y. Chen et al., “DaDianNao: A Machine-Learning Supercomputer,” 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, 2014, pp. 609-622, doi: 10.1109/MICRO.2014.58.

[2] O. Temam. A Defect-Tolerant Accelerator for Emerging High-Performance Applications. In International Symposium on Computer Architecture, Portland, Oregon, 2012.

[3] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger. Neural Acceleration for General-Purpose Approximate Programs. In International Symposium on Microarchitecture, number 3, pages 1–6, 2012.

[4] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam. DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning. In International Conference on Architectural Support for Programming Languages and Operating Systems, 2014

标签:layers,node,ConvNet,nm,neurons,paper,memory,GPU,Processor
来源: https://blog.csdn.net/Hide_in_Code/article/details/116309763