Warm tip: This article is reproduced from serverfault.com, please click

assembly compiler-optimization cpu-registers x86-64

Why is the %rax register used in the assembly for this procedure with 8 args?

发布于 2020-12-09 03:42:31

I have the following C function:

void proc(long  a1, long  *a1p,
          int   a2, int   *a2p,
          short a3, short *a3p,
          char  a4, char  *a4p)
{
    *a1p += a1;
    *a2p += a2;
    *a3p += a3;
    *a4p += a4;
}

Using Godbolt, I've converted it to x86_64 assembly (for simplicity, I used the -Og flag to minimize optimizations). It produces the following assembly:

proc:
        movq    16(%rsp), %rax
        addq    %rdi, (%rsi)
        addl    %edx, (%rcx)
        addw    %r8w, (%r9)
        movl    8(%rsp), %edx
        addb    %dl, (%rax)
        ret

I'm confused by the first line of assembly: movq 16(%rsp), %rax. I know that the %rax register is used to store a return value. But the proc procedure doesn't have a return value. So I'm curious why that register is being used here, as opposed to %r9 or some other register which isn't used for the purpose of returning values.

I'm also confused about the location of this instruction, relative to the others. It appears first, well before its destination register %rax is needed for anything (indeed, this register isn't needed until the last step). It also appears before addq %rdi, (%rsi), which is the translation of the first line of code in the procedure (*a1p += a1;).

What am I missing?

Questioner

Richie Thomas

Viewed

0

Peter Cordes 2020-12-09 21:52:51

It's just using a scratch reg to load a stack arg. RAX is the go-to choice for a scratch reg. This function has no return value so RAX is not special.

Scheduling a load early is generally a good idea to hide load-use latency so out-of-order exec doesn't have to work as hard to hide it. Remember, this is optimized code so the instructions for each C statement aren't separate single blocks. For something this simple, that's good (un-optimized would store everything to the stack and then reload it. See also this)

R9 would be a worse choice because it's already occupied (with another arg) on function entry, limiting instruction scheduling. And more importantly because addb %dl, (%r9) would need a REX prefix while addb %dl, (%rax) doesn't. So it would waste code size.

The already-in-use downside doesn't apply to R10 or R11 (like RAX they're pure call-clobbered but not used for arg passing), but the code-size downside still does.

R9B wouldn't even make sense; the stack arg is a pointer. The only byte register being used is DL (char a4), after loading into EDX.

(A dword load avoids writing a partial register, and movzx / movzbl isn't needed because callers typically write the whole qword, or at least dword, even for narrow args).

The compiler could have moved this load earlier as well, but chose not to. But add %dl, (%rax) is an RMW on (%rax), so the dl data isn't needed until a load from (%rax) has that data ready. Having the RAX address ready early is more valuable than the DL data because the address is being used for another load, not ALU -> store.

热门github

1

A multi-platform library for OpenGL, OpenGL ES, Vulkan, window and input

2

Dev tool that writes scalable apps from scratch while the developer oversees the implementation

3

shadcn/ui, but for Svelte. ✨

4

The Python Risk Identification Tool for generative AI (PyRIT) is an open access automation framework to empower security professionals and machine learning engineers to proactively find risks in their generative AI systems.

5

Performance-portable, length-agnostic SIMD with runtime dispatch

6

ZK Credo

7

OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement

8

Joplin - the secure note taking and to-do app with synchronisation capabilities for Windows, macOS, Linux, Android and iOS.

9

Mamba is a new state space model architecture showing promising performance on information-dense data such as language modeling, where previous subquadratic models fall short of Transformers. It is based on the line of progress on structured state space models, with an efficient hardware-aware design and implementation in the spirit of FlashAttention.

10

This repository contains System Design resources which are useful while preparing for interviews and learning Distributed Systems

11

Curso para aprender el lenguaje de programación Python desde cero y para principiantes. 75 clases, 37 horas en vídeo, código, proyectos y grupo de chat. Fundamentos, frontend, backend, testing, IA...

12

🎓 Path to a free self-taught education in Computer Science!

13

1️⃣🐝🏎️ The One Billion Row Challenge -- A fun exploration of how quickly 1B rows from a text file can be aggregated with Java

14

A collective list of free APIs

15

📚 Freely available programming books