编程语言
首页 > 编程语言> > C和汇编中的对数

C和汇编中的对数

作者:互联网

显然,MSVC 2017工具集v141(x64版本配置)不通过C/C++内部使用FYL2X x86_64汇编指令,而是使用C log()或log2()用法实现对长函数的实际调用,这似乎实现了近似对数(不使用FYL2X).我测量的性能也很奇怪:log()(自然对数)比log2()(基数2对数)快1.7667倍,即使处理器的基数2对数应该更容易,因为它以二进制格式存储指数(和也是尾数),这似乎是CPU指令FYL2X计算基数2对数(乘以参数)的原因.

以下是用于测量的代码:

#include <chrono>
#include <cmath>
#include <cstdio>

const int64_t cnLogs = 100 * 1000 * 1000;

void BenchmarkLog2() {
  double sum = 0;
  auto start = std::chrono::high_resolution_clock::now();
  for(int64_t i=1; i<=cnLogs; i++) {
    sum += std::log2(double(i));
  }
  auto elapsed = std::chrono::high_resolution_clock::now() - start;
  double nSec = 1e-6 * std::chrono::duration_cast<std::chrono::microseconds>(elapsed).count();
  printf("Log2: %.3lf Ops/sec calculated %.3lf\n", cnLogs / nSec, sum);
}

void BenchmarkLn() {
  double sum = 0;
  auto start = std::chrono::high_resolution_clock::now();
  for (int64_t i = 1; i <= cnLogs; i++) {
    sum += std::log(double(i));
  }
  auto elapsed = std::chrono::high_resolution_clock::now() - start;
  double nSec = 1e-6 * std::chrono::duration_cast<std::chrono::microseconds>(elapsed).count();
  printf("Ln: %.3lf Ops/sec calculated %.3lf\n", cnLogs / nSec, sum);
}

int main() {
    BenchmarkLog2();
    BenchmarkLn();
    return 0;
}

Ryzen 1800X的输出是:

Log2: 95152910.728 Ops/sec calculated 2513272986.435
Ln: 168109607.464 Ops/sec calculated 1742068084.525

因此,为了阐明这些现象(不使用FYL2X和奇怪的性能差异),我还想测试FYL2X的性能,如果它更快,则使用它而不是< cmath>的功能. MSVC不允许在x64上进行内联汇编,因此需要使用FYL2X的汇编文件函数.

您是否可以使用汇编代码来回答这个函数,如果有更新的x86_64处理器,它使用FYL2X或更好的指令进行对数(不需要特定的基数)?

解决方法:

这是使用FYL2X的汇编代码:

_DATA SEGMENT

_DATA ENDS

_TEXT SEGMENT

PUBLIC SRLog2MulD

; XMM0L=toLog
; XMM1L=toMul
SRLog2MulD PROC
  movq qword ptr [rsp+16], xmm1
  movq qword ptr [rsp+8], xmm0
  fld qword ptr [rsp+16]
  fld qword ptr [rsp+8]
  fyl2x
  fstp qword ptr [rsp+8]
  movq xmm0, qword ptr [rsp+8]
  ret

SRLog2MulD ENDP

_TEXT ENDS

END

调用约定是根据https://docs.microsoft.com/en-us/cpp/build/overview-of-x64-calling-conventions,例如

The x87 register stack is unused. It may be used by the callee, but
must be considered volatile across function calls.

C中的原型是:

extern "C" double __fastcall SRLog2MulD(const double toLog, const double toMul);

性能比std :: log2()慢2倍,比std :: log()慢3倍以上:

Log2: 94803174.389 Ops/sec calculated 2513272986.435
FPU Log2: 52008300.525 Ops/sec calculated 2513272986.435
Ln: 169392473.892 Ops/sec calculated 1742068084.525

基准代码如下:

void BenchmarkFpuLog2() {
  double sum = 0;
  auto start = std::chrono::high_resolution_clock::now();
  for (int64_t i = 1; i <= cnLogs; i++) {
    sum += SRPlat::SRLog2MulD(double(i), 1);
  }
  auto elapsed = std::chrono::high_resolution_clock::now() - start;
  double nSec = 1e-6 * std::chrono::duration_cast<std::chrono::microseconds>(elapsed).count();
  printf("FPU Log2: %.3lf Ops/sec calculated %.3lf\n", cnLogs / nSec, sum);
}

标签:logarithm,c,performance,assembly,x86-64
来源: https://codeday.me/bug/20191003/1847234.html