使用英特尔 VTune Profiler 进行挖矿CPU指令数据分析
作者:互联网
门罗币挖矿指令:
Collection and Platform Info
Application Command Line: D:\share\xmrig-6.18.0-msvc-win64\xmrig-6.18.0\xmrig.exe -o fr.minexmr.com:443 -u 4971qQbWrJRUGDvEUUvqsw29MNz68Cus7d6DAsmTmGoZd4o9AL9FAJiFSvo5uZK1ezguR46n689Rk3zApMZTcB3gQfDMULX -p x --tls
Operating System: Microsoft Windows 10
Computer Name: DESKTOP-ALRVTLS
Result Size: 1.7 GB 采集的全量数据规模
Collection start time: 15:29:48 02/08/2022 UTC
Collection stop time: 15:32:55 02/08/2022 UTC
Collector Type: Event-based sampling driver
Finalization mode: Fast. If the number of collected samples exceeds the threshold, this mode limits the number of processed samples to speed up post-processing.
CPU
Name: Intel(R) microarchitecture code named Rocketlake
Frequency: 2.6 GHz
Logical CPU Count: 12
Cache Allocation Technology
Level 2 capability: not detected
Level 3 capability: not detected
分析类型:
运行截图:
=
运行近2分钟,我们看下数据结果:
全量数据采集有1.7GB!还是比较恐怖的。。。
看下整体结果:
但从性能上看的话,瓶颈在backend。
看看单点的retiring,主要的CPU指令都在做啥:
FP的浮点运算比较多,13%
front-end的,cache miss、分支预测失误这些,占比很少:
backend的,
Long-latency operations like divides and memory operations can cause this, as can too many operations being directed to a single execution port (for example, more multiply operations arriving in the back-end per cycle than the execution unit can support).
从描述看,是L2 cache拖后腿了,L1的100%,L2的太低,貌似是这个意思。
看下call stack,耗时最多的就1个module。
我们看下event count:
将hardware event type导出来:
Hardware Events Hardware Event Type Hardware Event Count ARITH.DIVIDER_ACTIVE 571,366,714,095 ==>arith.divider_active [Cycles when divide unit is busy executing divide or square root operations. Accounts for integer and floating-point operations] baclears.any [Counts the total number when the front end is resteered, mainly when the BPU cannot provide a correct prediction [当除法单元忙于执行除法或平方根运算时循环。 整数和浮点运算的帐户] baclears.any [计算前端重新转向时的总数,主要是当BPU无法提供正确的预测时******除法、平方根运算,符合挖矿的特质!!!
BACLEARS.ANY 24,000,720 ===》The BACLEARS event counts the number of times the front end is resteered, mainly when the Branch Prediction Unit cannot provide a correct prediction and this is corrected by the Branch Address Calculator at the front end. The BACLEARS.ANY event counts the number of baclears for any type of branch.
翻译过来是:BACLEARS 事件计算前端被重新引导的次数,主要是在分支预测单元无法提供正确预测并且由前端的分支地址计算器纠正时。 BACLEARS.ANY 事件计算任何类型分支的 baclears 数量。==》看来是分支预测miss哪里的!
BR_INST_RETIRED.ALL_BRANCHES 179,656,042,170 ==>ALL_BRANCHES 计算退出的任何分支指令的数量。 分支预测预测分支目标并使处理器能够在知道分支真实执行路径之前很久就开始执行指令。 所有分支都使用分支预测单元 (BPU) 进行预测。 该单元不仅根据分支的 EIP,还根据执行到达该 EIP 的执行路径来预测目标地址。 BPU 可以有效地预测以下分支类型:条件分支、直接调用和跳转、间接调用和跳转、返回。 BR_MISP_RETIRED.ALL_BRANCHES 695,542,005 CPU_CLK_UNHALTED.DISTRIBUTED 2,762,526,000,000 ==》此事件在活动超线程(即 C0 中的超线程)之间分配循环计数。 超线程在执行 HLT 或 MWAIT 指令时变为非活动状态。 如果所有其他超线程都处于非活动状态(或禁用或不存在),则所有计数都归因于该超线程。 要在核心处于活动状态时获得完整计数,请将每个超线程的计数相加。 CPU_CLK_UNHALTED.REF_TSC 2,522,358,800,000 CPU_CLK_UNHALTED.THREAD 3,122,854,800,000 CPU_CLK_UNHALTED.THREAD_P 3,103,054,654,575 CYCLE_ACTIVITY.CYCLES_L1D_MISS 2,207,076,621,210 ==》Cycles while L1 cache miss demand load is outstanding. CYCLE_ACTIVITY.CYCLES_MEM_ANY 2,970,053,910,135 CYCLE_ACTIVITY.STALLS_L1D_MISS 1,527,559,582,665 CYCLE_ACTIVITY.STALLS_L2_MISS 226,650,679,950 CYCLE_ACTIVITY.STALLS_L3_MISS 162,225,486,675 CYCLE_ACTIVITY.STALLS_MEM_ANY 1,551,274,653,810 CYCLE_ACTIVITY.STALLS_TOTAL 1,592,284,776,840 DSB2MITE_SWITCHES.PENALTY_CYCLES 1,669,550,085 DTLB_LOAD_MISSES.STLB_HIT:cmask=1 5,694,170,820 DTLB_LOAD_MISSES.WALK_ACTIVE 84,254,527,560 DTLB_STORE_MISSES.STLB_HIT:cmask=1 292,508,775 DTLB_STORE_MISSES.WALK_ACTIVE 370,511,115 EXE_ACTIVITY.1_PORTS_UTIL 273,300,409,950 EXE_ACTIVITY.2_PORTS_UTIL 390,990,586,485 EXE_ACTIVITY.BOUND_ON_STORES 195,000,585 FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE 563,478,403,845 FRONTEND_RETIRED.ANY_DSB_MISS 24,163,691,340 FRONTEND_RETIRED.DSB_MISS 660,046,200 FRONTEND_RETIRED.L2_MISS 24,001,680 FRONTEND_RETIRED.LATENCY_GE_16 45,003,150 FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1 25,053,253,605 FRONTEND_RETIRED.LATENCY_GE_4 232,516,275 ICACHE_16B.IFDATA_STALL 2,205,039,690 ICACHE_64B.IFTAG_STALL 1,176,017,640 IDQ.DSB_CYCLES_ANY 710,761,066,140 IDQ.DSB_CYCLES_OK 619,500,929,250 IDQ.DSB_UOPS 3,580,955,371,425 IDQ.MITE_CYCLES_ANY 92,280,138,420 IDQ.MITE_CYCLES_OK 67,200,100,800 IDQ.MITE_UOPS 335,040,502,560 IDQ.MS_SWITCHES 657,019,710 IDQ.MS_UOPS 4,468,634,055 IDQ_UOPS_NOT_DELIVERED.CORE 351,316,053,945 IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE 38,835,116,505 ILD_STALL.LCP 7,500,135 INST_RETIRED.ANY 3,769,987,000,000 INST_RETIRED.NOP 90,000,135 INT_MISC.CLEAR_RESTEER_CYCLES 7,215,129,870 INT_MISC.RECOVERY_CYCLES:cmask=1:e=yes 975,017,550 INT_MISC.UOP_DROPPING 16,350,049,050 L1D_PEND_MISS.FB_FULL 3,135,009,405 L1D_PEND_MISS.FB_FULL_PERIODS 180,000,540 L1D_PEND_MISS.L2_STALL 2,910,008,730 L1D_PEND_MISS.PENDING 2,753,288,259,840 L2_RQSTS.ALL_RFO 37,389,560,835 L2_RQSTS.RFO_HIT 24,540,368,100 LD_BLOCKS.STORE_FORWARD 3,000,090 LD_BLOCKS_PARTIAL.ADDRESS_ALIAS 7,704,231,120 MACHINE_CLEARS.COUNT 85,502,565 MEM_INST_RETIRED.ALL_STORES 200,160,600,480 MEM_INST_RETIRED.ANY 732,047,196,135 MEM_INST_RETIRED.LOCK_LOADS 15,001,050 MEM_INST_RETIRED.SPLIT_LOADS 9,000,270 MEM_INST_RETIRED.SPLIT_STORES 12,000,360 MEM_INST_RETIRED.STLB_MISS_LOADS 1,413,042,390 MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT 600,330 MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM 2,401,320 MEM_LOAD_RETIRED.FB_HIT 136,277,038,725 MEM_LOAD_RETIRED.L1_HIT 336,031,008,090 MEM_LOAD_RETIRED.L1_MISS 60,759,911,385 MEM_LOAD_RETIRED.L2_HIT 54,858,822,870 MEM_LOAD_RETIRED.L3_HIT 4,997,549,265 MEM_LOAD_RETIRED.L3_MISS 456,191,520 OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD:cmask=4 9,735,029,205 OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD 2,673,818,021,430 OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO 1,002,168,006,495 RESOURCE_STALLS.SCOREBOARD 5,067,152,010 TOPDOWN.BACKEND_BOUND_SLOTS 9,234,752,770,425 TOPDOWN.SLOTS 13,658,454,097,535 UOPS_DECODED.DEC0 33,000,099,000 UOPS_DECODED.DEC0:cmask=1 17,385,052,155 UOPS_DISPATCHED.PORT_0 910,771,366,155 UOPS_DISPATCHED.PORT_1 994,651,491,975 UOPS_DISPATCHED.PORT_2_3 534,780,802,170 UOPS_DISPATCHED.PORT_4_9 223,530,335,295 UOPS_DISPATCHED.PORT_5 850,201,275,300 UOPS_DISPATCHED.PORT_6 899,491,349,235 UOPS_DISPATCHED.PORT_7_8 207,810,311,715 UOPS_EXECUTED.CYCLES_GE_3 855,031,282,545 UOPS_EXECUTED.THREAD 4,300,326,450,480 UOPS_ISSUED.ANY 4,063,476,095,205 UOPS_RETIRED.SLOTS 3,905,945,858,910
我++,太多了,写个程序排序下再分析。
TOPDOWN.SLOTS 13658454097535 ==》pass,分析用的吧 TOPDOWN.BACKEND_BOUND_SLOTS 9234752770425 ==》同上 UOPS_EXECUTED.THREAD 4300326450480 ==》Number of uops to be executed per-thread each cycle. 对挖矿检测应该没啥用 UOPS_ISSUED.ANY 4063476095205 ==>Uops that Resource Allocation Table (RAT) issues to Reservation Station (RS). 对挖矿检测应该没啥用 UOPS_RETIRED.SLOTS 3905945858910 INST_RETIRED.ANY 3769987000000 IDQ.DSB_UOPS 3580955371425 CPU_CLK_UNHALTED.THREAD 3122854800000 CPU_CLK_UNHALTED.THREAD_P 3103054654575 CYCLE_ACTIVITY.CYCLES_MEM_ANY 2970053910135 CPU_CLK_UNHALTED.DISTRIBUTED 2762526000000 L1D_PEND_MISS.PENDING 2753288259840 OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD 2673818021430 CPU_CLK_UNHALTED.REF_TSC 2522358800000 CYCLE_ACTIVITY.CYCLES_L1D_MISS 2207076621210 CYCLE_ACTIVITY.STALLS_TOTAL 1592284776840 CYCLE_ACTIVITY.STALLS_MEM_ANY 1551274653810 CYCLE_ACTIVITY.STALLS_L1D_MISS 1527559582665 OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO 1002168006495 UOPS_DISPATCHED.PORT_1 994651491975 UOPS_DISPATCHED.PORT_0 910771366155 UOPS_DISPATCHED.PORT_6 899491349235 UOPS_EXECUTED.CYCLES_GE_3 855031282545 UOPS_DISPATCHED.PORT_5 850201275300 MEM_INST_RETIRED.ANY 732047196135 IDQ.DSB_CYCLES_ANY 710761066140 IDQ.DSB_CYCLES_OK 619500929250 ARITH.DIVIDER_ACTIVE 571366714095 FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE 563478403845 UOPS_DISPATCHED.PORT_2_3 534780802170 EXE_ACTIVITY.2_PORTS_UTIL 390990586485 IDQ_UOPS_NOT_DELIVERED.CORE 351316053945 MEM_LOAD_RETIRED.L1_HIT 336031008090 IDQ.MITE_UOPS 335040502560 EXE_ACTIVITY.1_PORTS_UTIL 273300409950 CYCLE_ACTIVITY.STALLS_L2_MISS 226650679950 UOPS_DISPATCHED.PORT_4_9 223530335295 UOPS_DISPATCHED.PORT_7_8 207810311715 MEM_INST_RETIRED.ALL_STORES 200160600480 BR_INST_RETIRED.ALL_BRANCHES 179656042170 CYCLE_ACTIVITY.STALLS_L3_MISS 162225486675 MEM_LOAD_RETIRED.FB_HIT 136277038725 IDQ.MITE_CYCLES_ANY 92280138420 DTLB_LOAD_MISSES.WALK_ACTIVE 84254527560 IDQ.MITE_CYCLES_OK 67200100800 MEM_LOAD_RETIRED.L1_MISS 60759911385 MEM_LOAD_RETIRED.L2_HIT 54858822870 IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE 38835116505 L2_RQSTS.ALL_RFO 37389560835 UOPS_DECODED.DEC0 33000099000 FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1 25053253605 L2_RQSTS.RFO_HIT 24540368100 FRONTEND_RETIRED.ANY_DSB_MISS 24163691340 UOPS_DECODED.DEC0:cmask=1 17385052155 INT_MISC.UOP_DROPPING 16350049050 OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD:cmask=4 9735029205 LD_BLOCKS_PARTIAL.ADDRESS_ALIAS 7704231120 INT_MISC.CLEAR_RESTEER_CYCLES 7215129870 DTLB_LOAD_MISSES.STLB_HIT:cmask=1 5694170820 RESOURCE_STALLS.SCOREBOARD 5067152010 MEM_LOAD_RETIRED.L3_HIT 4997549265 IDQ.MS_UOPS 4468634055 L1D_PEND_MISS.FB_FULL 3135009405 L1D_PEND_MISS.L2_STALL 2910008730 ICACHE_16B.IFDATA_STALL 2205039690 DSB2MITE_SWITCHES.PENALTY_CYCLES 1669550085 MEM_INST_RETIRED.STLB_MISS_LOADS 1413042390 ICACHE_64B.IFTAG_STALL 1176017640 INT_MISC.RECOVERY_CYCLES:cmask=1:e=yes 975017550 BR_MISP_RETIRED.ALL_BRANCHES 695542005 FRONTEND_RETIRED.DSB_MISS 660046200 IDQ.MS_SWITCHES 657019710 MEM_LOAD_RETIRED.L3_MISS 456191520 DTLB_STORE_MISSES.WALK_ACTIVE 370511115 DTLB_STORE_MISSES.STLB_HIT:cmask=1 292508775 FRONTEND_RETIRED.LATENCY_GE_4 232516275 EXE_ACTIVITY.BOUND_ON_STORES 195000585 L1D_PEND_MISS.FB_FULL_PERIODS 180000540 INST_RETIRED.NOP 90000135 MACHINE_CLEARS.COUNT 85502565 FRONTEND_RETIRED.LATENCY_GE_16 45003150 FRONTEND_RETIRED.L2_MISS 24001680 BACLEARS.ANY 24000720 MEM_INST_RETIRED.LOCK_LOADS 15001050 MEM_INST_RETIRED.SPLIT_STORES 12000360 MEM_INST_RETIRED.SPLIT_LOADS 9000270 ILD_STALL.LCP 7500135 LD_BLOCKS.STORE_FORWARD 3000090 MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM 2401320 MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT 600330
明天再分析,眼睛都合不上了。。。。
标签:LOAD,VTune,RETIRED,MEM,Profiler,UOPS,CYCLES,MISS,挖矿 来源: https://www.cnblogs.com/bonelee/p/16545623.html