x86-64 程序集的性能优化 - 对齐和分支预测

本文介绍了x86-64 程序集的性能优化 - 对齐和分支预测的处理方法，对大家解决问题具有一定的参考价值

问题描述

我目前正在编写一些 C99 标准库字符串函数的高度优化版本，例如 strlen()、memset() 等，使用 x86-64 汇编和SSE-2 说明.

I’m currently coding highly optimised versions of some C99 standard library string functions, like strlen(), memset(), etc, using x86-64 assembly with SSE-2 instructions.

到目前为止，我已经设法在性能方面取得了出色的结果，但是当我尝试进行更多优化时，有时会出现奇怪的行为.

So far I’ve managed to get excellent results in terms of performance, but I sometimes get weird behaviour when I try to optimise more.

例如，添加甚至删除一些简单的指令，或者简单地重新组织一些与跳转一起使用的局部标签，都会完全降低整体性能.并且在代码方面绝对没有理由.

For instance, adding or even removing some simple instructions, or simply reorganising some local labels used with jumps completely degrades the overall performances. And there’s absolutely no reason in terms of code.

所以我的猜测是代码对齐和/或分支预测存在一些问题.

So my guess is that there is some issues with code alignment, and/or with branches which get mispredicted.

我知道，即使使用相同的架构 (x86-64)，不同的 CPU 也有不同的分支预测算法.

I know that, even with the same architecture (x86-64), different CPUs have different algorithms for branch prediction.

但是在 x86-64 上开发高性能时，是否有一些关于代码对齐和分支预测的一般建议?

特别是关于对齐，我是否应该确保跳转指令使用的所有标签都在 DWORD 上对齐?

In particular, about alignment, should I ensure all labels used by jump instructions are aligned on a DWORD?

_func:
    ; ... Some code ...
    test rax, rax
    jz   .label
    ; ... Some code ...
    ret
    .label:
        ; ... Some code ...
        ret

在前面的代码中，我是否应该在 .label: 之前使用 align 指令，例如:

In the previous code, should I use an align directive before .label:, like:

align 4
.label:

如果是这样，在使用 SSE-2 时对齐 DWORD 是否足够?

If so, is it enough to align on a DWORD when using SSE-2?

关于分支预测，是否有一种首选"方式来组织跳转指令使用的标签，以帮助 CPU，或者今天的 CPU 是否足够聪明，可以通过计算分支的次数在运行时确定这一点拍了吗?

And about branch prediction, is there a «preffered» way to organize the labels used by jump instructions, in order to help the CPU, or are today's CPUs smart enough to determine that at runtime by counting the number of times a branch is taken?

编辑

好的，这是一个具体的例子 - 这是带有 SSE-2 的 strlen() 的开始:

Ok, here's a concrete example - here's the start of strlen() with SSE-2:

_strlen64_sse2:
    mov         rsi,    rdi
    and         rdi,    -16
    pxor        xmm0,   xmm0
    pcmpeqb     xmm0,   [ rdi ]
    pmovmskb    rdx,    xmm0
    ; ...

使用 1000 个字符串运行 10'000'000 次大约需要 0.48 秒，这很好.
但它不检查 NULL 字符串输入.很明显，我将添加一个简单的检查:

Running it 10'000'000 times with a 1000 character string gives about 0.48 seconds, which is fine.
But it does not check for a NULL string input. So obviously, I'll add a simple check:

_strlen64_sse2:
    test       rdi,    rdi
    jz          .null
    ; ...

同样的测试，现在运行时间为 0.59 秒.但是，如果我在此检查后对齐代码:

Same test, it runs now in 0.59 seconds. But if I align the code after this check:

_strlen64_sse2:
    test       rdi,    rdi
    jz          .null
    align      8
    ; ...

原来的表演又回来了.我使用 8 进行对齐，因为 4 不会改变任何东西.
谁能解释一下，并就何时对齐或不对齐代码段给出一些建议?

The original performances are back. I used 8 for alignment, as 4 doesn't change anything.
Can anyone explain this, and give some advices about when to align, or not to align code sections?

编辑 2

当然，这不是对齐每个分支目标那么简单.如果我这样做，性能通常会变得更糟，除非某些特定情况如上述.

Of course, it's not as simple as aligning every branch target. If I do it, performances will usually get worse, unless some specific cases like above.

对齐优化

1.使用 `.p2align <abs-expr><abs-expr>` 而不是 `align`.

使用其 3 个参数授予细粒度控制

Alignment optimisations

1. Use `.p2align <abs-expr> <abs-expr> <abs-expr>` instead of `align`.

Grants fine-grained control using its 3 params

param1 - 与边界对齐.
param2 - 用什么(零或 NOPs)填充填充.
param3 - 如果填充超过指定的字节数，则不要对齐.

param1 - Align to what boundary.
param2 - Fill padding with what (zeroes or NOPs).
param3 - Do NOT align if padding would exceed specified number of bytes.

这增加了整个代码块位于单个缓存行中的机会.一旦加载到 L1 缓存中，就可以完全运行而无需访问 RAM 来获取指令.这对于具有大量迭代的循环非常有益.

  /* nop */
  static const char nop_1[] = { 0x90 };

  /* xchg %ax,%ax */
  static const char nop_2[] = { 0x66, 0x90 };

  /* nopl (%[re]ax) */
  static const char nop_3[] = { 0x0f, 0x1f, 0x00 };

  /* nopl 0(%[re]ax) */
  static const char nop_4[] = { 0x0f, 0x1f, 0x40, 0x00 };

  /* nopl 0(%[re]ax,%[re]ax,1) */
  static const char nop_5[] = { 0x0f, 0x1f, 0x44, 0x00, 0x00 };

  /* nopw 0(%[re]ax,%[re]ax,1) */
  static const char nop_6[] = { 0x66, 0x0f, 0x1f, 0x44, 0x00, 0x00 };

  /* nopl 0L(%[re]ax) */
  static const char nop_7[] = { 0x0f, 0x1f, 0x80, 0x00, 0x00, 0x00, 0x00 };

  /* nopl 0L(%[re]ax,%[re]ax,1) */
  static const char nop_8[] =
    { 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00};

  /* nopw 0L(%[re]ax,%[re]ax,1) */
  static const char nop_9[] =
    { 0x66, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 };

  /* nopw %cs:0L(%[re]ax,%[re]ax,1) */
  static const char nop_10[] =
    { 0x66, 0x2e, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 };

(最多 10 字节 NOP s for x86.来源 binutils-2.2.3.)

(upto 10byte NOPs for x86. Source binutils-2.2.3.)

<子>x86_64 微架构/代之间有很多变化.然而，适用于所有这些准则的一组通用准则可以总结如下.参考:第 3 节Agner Fog 的 x86 微架构手册.

循环检测逻辑保证仅适用于 <的循环.64 次迭代.这是因为这样一个事实，如果一条分支指令以一种方式运行 n-1 次，然后以另一种方式运行 1 次，对于任何n 最多 64.

Loop detection logic is guaranteed to work ONLY for loops with < 64 iterations. This is due to the fact that a branch instruction is recognized as having loop behavior if it goes one way n-1 times and then goes the other way 1 time, for any n upto 64.

这并不真正适用于 Haswell 和更高版本中的预测器，它们使用 TAGE 预测器并且没有针对特定分支的专用循环检测逻辑.对于在 Skylake 上没有其他分支的紧密外循环内的内循环而言，约 23 次的迭代计数可能是最坏的情况:大多数情况下，内循环的退出都会错误预测，但行程计数非常低，以至于经常发生.展开可以帮助缩短模式，但对于非常高的循环行程计数，最后的单个错误预测会在很多行程中分摊，并且需要不合理的展开量才能对此做任何事情.

This doesn't really apply for the predictors in Haswell and later which use a TAGE predictor and doesn't have dedicated loop-detection logic for specific branches. Iteration counts of ~23 can be the worst-case for an inner loop inside a tight outer loop with no other branching, on Skylake: the exit from the inner loop mispredicts most times, but the trip count is so low that it happens often. Unrolling can help by shortening the pattern, but for very high loop trip counts the single mispredict at the end is amortized over a lot of trips and it would take an unreasonable amount of unrolling to do anything about it.

远跳不是预测的，即管道总是在远跳到新代码段 (CS:RIP) 时停顿.无论如何，基本上从来没有理由使用远跳，所以这基本上不相关.

Far jumps are not predicted i.e. pipeline always stalls on a far jump to a new code segment (CS:RIP). There's basically never a reason to use a far jump anyway so this is mostly not relevant.

在大多数 CPU 上通常可以预测具有任意 64 位绝对地址的间接跳转.

Indirect jumps with an arbitrary 64-bit absolute address are predicted normally on most CPUs.

但是 Silvermont(英特尔的低功耗 CPU)在目标距离超过 4GB 时预测间接跳转有一些限制，因此可以通过在虚拟地址空间的低 32 位加载/映射可执行文件和共享库来避免这种情况一场胜利.例如在 GNU/Linux 上通过设置环境变量 LD_PREFER_MAP_32BIT_EXEC.有关更多信息，请参阅英特尔的优化手册.


But Silvermont (Intel's low-power CPUs) have some limitations in predicting indirect jumps when the target is more than 4GB away, so avoiding that by loading/mapping executables and shared libraries in the low 32 bits of virtual address space can be a win there.  e.g. on GNU/Linux by setting the environment variable LD_PREFER_MAP_32BIT_EXEC.  See Intel's optimization manual for more.

                        这篇关于x86-64 程序集的性能优化 - 对齐和分支预测的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，WP2

x86-64 程序集的性能优化 - 对齐和分支预测

问题描述

推荐答案

对齐优化

1.使用 `.p2align <abs-expr><abs-expr>` 而不是 `align`.

Alignment optimisations

1. Use `.p2align <abs-expr> <abs-expr> <abs-expr>` instead of `align`.

admin_action_{$_REQUEST[‘action’]}

admin_footer-{$GLOBALS[‘hook_suffix’]}

customize_save_{$this->id_data[‘base’]}

customize_value_{$this->id_data[‘base’]}

get_comment_author_url

network_admin_edit_{$_GET[‘action’]}

network_sites_updated_message_{$_GET[‘updated’]}

pre_wp_is_site_initialized

WordPress 的SEO 教学：如何在网站中加入关键字（Meta Keywords）与Meta 描述（Meta Description）？

谷歌的SEO是什么

x86-64 程序集的性能优化 - 对齐和分支预测

问题描述

推荐答案

对齐优化

1.使用 .p2align <abs-expr><abs-expr> 而不是 align.

Alignment optimisations

1. Use .p2align <abs-expr> <abs-expr> <abs-expr> instead of align.

1.使用 `.p2align <abs-expr><abs-expr>` 而不是 `align`.

1. Use `.p2align <abs-expr> <abs-expr> <abs-expr>` instead of `align`.