Warm tip: This article is reproduced from serverfault.com, please click

Why is the %rax register used in the assembly for this procedure with 8 args?

发布于 2020-12-09 03:42:31

I have the following C function:

void proc(long  a1, long  *a1p,
          int   a2, int   *a2p,
          short a3, short *a3p,
          char  a4, char  *a4p)
{
    *a1p += a1;
    *a2p += a2;
    *a3p += a3;
    *a4p += a4;
}

Using Godbolt, I've converted it to x86_64 assembly (for simplicity, I used the -Og flag to minimize optimizations). It produces the following assembly:

proc:
        movq    16(%rsp), %rax
        addq    %rdi, (%rsi)
        addl    %edx, (%rcx)
        addw    %r8w, (%r9)
        movl    8(%rsp), %edx
        addb    %dl, (%rax)
        ret

I'm confused by the first line of assembly: movq 16(%rsp), %rax. I know that the %rax register is used to store a return value. But the proc procedure doesn't have a return value. So I'm curious why that register is being used here, as opposed to %r9 or some other register which isn't used for the purpose of returning values.

I'm also confused about the location of this instruction, relative to the others. It appears first, well before its destination register %rax is needed for anything (indeed, this register isn't needed until the last step). It also appears before addq %rdi, (%rsi), which is the translation of the first line of code in the procedure (*a1p += a1;).

What am I missing?

Questioner
Richie Thomas
Viewed
0
Peter Cordes 2020-12-09 21:52:51

It's just using a scratch reg to load a stack arg. RAX is the go-to choice for a scratch reg. This function has no return value so RAX is not special.

Scheduling a load early is generally a good idea to hide load-use latency so out-of-order exec doesn't have to work as hard to hide it. Remember, this is optimized code so the instructions for each C statement aren't separate single blocks. For something this simple, that's good (un-optimized would store everything to the stack and then reload it. See also this)

R9 would be a worse choice because it's already occupied (with another arg) on function entry, limiting instruction scheduling. And more importantly because addb %dl, (%r9) would need a REX prefix while addb %dl, (%rax) doesn't. So it would waste code size.

The already-in-use downside doesn't apply to R10 or R11 (like RAX they're pure call-clobbered but not used for arg passing), but the code-size downside still does.

R9B wouldn't even make sense; the stack arg is a pointer. The only byte register being used is DL (char a4), after loading into EDX.

(A dword load avoids writing a partial register, and movzx / movzbl isn't needed because callers typically write the whole qword, or at least dword, even for narrow args).

The compiler could have moved this load earlier as well, but chose not to. But add %dl, (%rax) is an RMW on (%rax), so the dl data isn't needed until a load from (%rax) has that data ready. Having the RAX address ready early is more valuable than the DL data because the address is being used for another load, not ALU -> store.