首页 > 其他分享> > Understanding Tensorcore Computing Performance of Ampere

Understanding Tensorcore Computing Performance of Ampere

2021-05-23 07:32:44 作者：互联网

看到下面的白皮书的两段话：https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf Using FP16/FP32 mixed-precision Tensor Core operations as an example, at the hardware level, each Tensor Core in the Volta architecture can execute 64 FP16 fused multiply-addoperations (FMAs) with FP32 accumulation per clock, allowing it to compute a mixed-precision4x4x4 matrix multiplication per clock. Since each Volta SM includes eight Tensor Cores, asingle SM delivers 512 FP16 FMA operations per clock or 1024 individual FP16 floating pointoperations per clock. Each of the A100 Tensor Cores can execute 256 FP16 FMA operationsper clock, allowing it to compute the results for an 8x4x8 mixed-precision matrix multiplicationper clock. Each SM in the A100 GPU includes four of the new redesigned Tensor Cores andtherefore each SM in A100 delivers 1024 FP16 FMA operations per clock (or 2048 individualFP16 floating point operations per clock)
Comparing total GPU performance, not just SM-level performance, the NVIDIA A100 TensorCore GPU with its 108 SMs includes a total of 432 Tensor Cores that deliver up to 312 TFLOPSof dense mixed-precision FP16/FP32 performance.

结合这个文章的介绍：http://hpcchina.net/nvidia-volta-v100-deep-learning-gpu/
对Tensore Core带来的加速算力有了进一步的理解：
使用Tensor Core执行混合精度运算为例；

对Volta:

矩阵大小是[4*4]FP16*[4*4]FP16+[4*4]FP32在硬件层面，一个Volta的Tensor Core在一个时钟周期可以执行4^3=64个FP16的乘法和FP32的加法运算；那么单时钟的混合精度的算力就是4^3个FMA；SM有8个Tensor Core，那么1个SM可以提供8*64=512个FP16 FMA，也就是1024 FP16 FLOPS;

对Ampere：

矩阵大小是[8*4]FP16*[4*8]FP16+[8*8]FP32每个A100的Tensor Core每个周期可以执行的是8*4*8=256个FP16的乘法和FP32的加法运算；那么单时钟的混合精度的算力就是256个FMA；SM有8个Tensor Core，那么1个SM可以提供4*256=1024个FP16 FMA，也就是2048 FP16 FLOPS;考虑A100有108个SM，工作在1410MHz，那么算力就是108SM*1410MHz*2048 FP16 FLOPS=312 FP16 TFLOPS
考虑到FP64Core是Tensor Core的8倍，但是Tensor Core单周期是256个FP16 FMA，FP64 Core单周期是1个FP64 FMA，因此使用Tensore Core的FP16算力是FP64算力的256/8=32倍；不使用的时候是4倍。FP64是9.7TFLOPS。

标签：Core,Tensor,Computing,clock,SM,FP16,Performance,Ampere,FMA
来源： https://www.cnblogs.com/kongchung/p/14800385.html