Undefined behaviour and nasal demons, or, do not meddle in the affairs of optimizers

Discussions on undefined behaviour often degenerate into vague mumblings about “nasal demons”. This comes from a 1992 usenet post on comp.lang.c, where the poster “quotes” the C89 standard (this is oddly hard to find now) as:

“1.6 Definitions of Terms
In this standard, … “shall not” is to be interpreted as a prohibition
* Undefined behavior — behavior, upon use of a nonportable or erroneous program construct, … for which the standard imposes no requirements. Permissible undefined behavior ranges from ignoring the situation completely with unpredictable results, to having demons fly out of your nose.”

John F. Woods

Essentially what he means is that the compiler is allowed to do anything, including things which seem completely nonsensical. Usually you feel you have a handle on what sort of things the compiler might do, but sometimes it can do some very unexpected things. Of course they make sense in the compiler’s internal logic and are not incorrect, but it can take a long time to figure out what line of reasoning led the compiler to its decision.

I have a nice example which was investigated and distilled down to a minimal example by my friend David McCabe who kindly allowed me to post it here (I’ve changed it slightly). The example involves accidental memory scribbling in the context of derived classes (so the vtable pointers get scribbled over) where the compiler can inline enough to “see” the scribbling and then reason about it.

Here is the code snippet (and Godbolt link to peruse):

#include <cstring>

void e();
void f();
void g();

class A {
public:
    virtual ~A() = default;
};

class B : public A {
public:
    virtual void x() {f();};
};


void r(A* a) {
    B c;
    std::memcpy(static_cast<void*>(a), static_cast<const void*>(&c), sizeof(A));
}

int main() {
    A a;
    r(&a);
    e();
    ((B*)&a)->x();
    g();
}

So, what’s going on, or what’s supposed to be going on? And by “supposed”, I mean what would one think a compiler with no optimizations most likely do, given what I know about how C++ code translates to unoptimized assembly.

So, we have a base class with a vtable, A, and in main create an instance a. Note that A and the derived class B, happen to be the same size (the size of the vtable pointer). The function r is the memory scribbler: it creates an instance of B and memcpy()s it over A. In an intuitive way, the instance a has been turned into an instance of b, because it’s had its vtable pointer changed to a B vtable pointer. This is really the only thing at a low level which actually causes these types to be different, so if you were to cast a pointer to a to a B pointer, it should now work as a B and you can make the virtual function call x on it.

And that’s what main does. The only purpose of e, f, and g are to place markers in the generated code so we can see what’s run.

Now, scribbling over vtables is wildly ridiculously undefined, and nasal demons my fly (spoiler alert: they do). And you also add in some stubs for e, f and g in a separate source file (to prevent it optimizing) like this:

#include <iostream>
void e(){ std::cerr << "e\n"; }
void f(){ std::cerr << "f\n"; }
void g(){ std::cerr << "g\n"; }

then compile and run it (first without optimization), and it prints:

e
f
g

which is what you might expect. But what about with optimizations? On my machine (gcc 7.5.0), if I compile the main file with -O3 and the stubs with no optimization, it prints out:

e
e
⋮
<repeats for a total of 37401 e's>
⋮
e
Segmentation fault (core dumped)

These are the nasal demons. How on earth can the optimizer have turned that code into a loop and a segfault? It turns out it makes sense, but only after a lot of investigation. But first I’m going to consider some other cases before getting on to gcc with -O3. The Godbolt link, does nice colourisation so you can see which program lines match to which assembly lines.

First, here’s what unoptimized GCC 7.5.0 does (note I’ve snipped a lot of the setup and teardown and so on):

        lea     rax, [rbp-24]
        mov     rdi, rax
        call    r(A*)
        call    e()
        lea     rax, [rbp-24]
        mov     rax, QWORD PTR [rax]
        add     rax, 16
        mov     rax, QWORD PTR [rax]
        lea     rdx, [rbp-24]
        mov     rdi, rdx
        call    rax
        call    g()
        lea     rax, [rbp-24]
        mov     rdi, rax
        call    A::~A() [complete object destructor]

I’ve highlighted the most relevant lines, which are in order a call to r, e, then an indirect call via the vtable, then g. It then calls the destructor for A not B. The call is non virtual because a is an instance not a pointer, so it ignores the scribbled vtable and calls the “wrong” destructor.

I’m now going to look at what other compilers do first because GCC 5 and newer definitely do the most unexpected thing. First, the last pre-5 version on Godbolt, GCC 4.9.4 with -O3:

main:
        sub     rsp, 40
        mov     QWORD PTR [rsp+16], OFFSET FLAT:vtable for B+16
        mov     QWORD PTR [rsp], OFFSET FLAT:vtable for B+16
        call    e()
        mov     rax, QWORD PTR [rsp]
        mov     rdi, rsp
        call    [QWORD PTR [rax+16]]
        call    g()
        xor     eax, eax
        add     rsp, 40
        ret

I’ve stripped out everything except main(). Note it’s inlined both r() and the call to memcpy. On line 3, it’s creating the concrete b, and copying in the vtable, on the following line, it’s memcpying that over a, but it’s noticed it can take the data from the source. then it calls e, makes a virtual call (which we know will be f) and then calls g. It inlines and removes the trivial ~A(), tears down main() and leaves.

So this program behaves as “expected”, it would print out e, f, g like the unoptimized one. MSVC 19.14 with /Ox does the same:

main    PROC
$LN19:
        sub     rsp, 40                             ; 00000028H
        lea     rax, OFFSET FLAT:const B::`vftable'
        mov     QWORD PTR a$[rsp], rax
        call    void e(void)                         ; e
        mov     rax, QWORD PTR a$[rsp]
        lea     rcx, QWORD PTR a$[rsp]
        call    QWORD PTR [rax+8]
        call    void g(void)                         ; g
        xor     eax, eax
        add     rsp, 40                             ; 00000028H
        ret     0
main    ENDP

Using 19.28 and compiling for x86 not x64 has an extra flourish where it doesn’t make an indirect call, instead it checks if it has a B vtable and makes a direct call. But that’s the same in essence since it’s still deciding based on the vtable:

_main   PROC
        push    ecx
        mov     DWORD PTR _a$[esp+4], OFFSET const B::`vftable'
        call    void e(void)                         ; e
        mov     eax, DWORD PTR _a$[esp+4]
        mov     eax, DWORD PTR [eax+4]
        cmp     eax, OFFSET virtual void B::x(void)      ; B::x
        jne     SHORT $LN6@main
        call    void f(void)                         ; f
        call    void g(void)                         ; g
        xor     eax, eax
        pop     ecx
$LN6@main:
        lea     ecx, DWORD PTR _a$[esp+4]
        call    eax
        call    void g(void)                         ; g
        xor     eax, eax
        pop     ecx
        ret     0        ret     0

Clang on the other hand will apply its powerful optimizer to this case:

main:                                   # @main
        push    rax
        call    e()
        call    f()
        call    g()
        xor     eax, eax
        pop     rcx
        ret

It’s likely using dataflow, so I am guessing it follows the provenance of the pointer, finds it’s been copied from the B vtable and then devirtualises. Note that if you put in destructors it will call ~A() not ~B() because like GCC, it never makes a virtual call in the first place. So far so sort-of sensible. You can nod approvingly at the power of clang’s optimizer but not feel pessimistic about VS2017 just punting on that and doing the “obvious” thing (except on x86, where it makes the same deduction but it much more conservative with its actions).

But what about GCC? This is present in all version from 5 onwards, and I’m picking 7.5 (my machine version, though 10.2, the latest on Godbolt, does the same) with -O3 and it does:

main:
        sub     rsp, 8
        call    e()
WAT

That’s it. The whole of main(). It calls e(), then nothing. No tear down, no return, nothing, it just falls off the end into whatever code happens to be lying there. This is the source of the nasal demons.

Undefined behaviour is not allowed, according to the standard. Therefore, according to GCC, it does not happen. The compiler goes one step further than clang and noticed that the line after e() has the wrong vtable and that is undefined behaviour so it has deduced that the line is never reached. And the only way for that to happen is if e() never returns therefore it marks the code after e() as unreachable and deletes it.

When the standard says “undefined”, the standard means it, and the compiler is allowed to reason backwards in time to make optimizations with the assumption that the program is valid. This is very unintuitive but entirely legal and part of a very powerful optimization pass.

This isn’t the compiler being, as some people feel, perversely pedantic just to mess with the unwary programmer.

It’s really handy: GCC can take some inlined code and figure out that a pointer is non-null based on how it’s used, and can then, say, travel back through the code with that knowledge and remove all the tests for nullness and alternative branches based on such tests making the inlined code both faster and more compact. So you write your code to be generic, and GCC gives you an extra-fast, extra compact version when it finds a special use case. It’s exactly the sort of thing a person might do, but it’s automatic and woe betide the person who violates the preconditions.

So the last question is why this specific behaviour? Why does it print out many lines and then quit? Well, here’s an objdump of the relevant part of the resulting executable:

int main() {
 8d0:	48 83 ec 08          	sub    $0x8,%rsp
    A a;
    r(&a);
    e();
 8d4:	e8 51 01 00 00       	callq  a2a <e()>
 8d9:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)

00000000000008e0 <_start>:
 8e0:	31 ed                	xor    %ebp,%ebp
 8e2:	49 89 d1             	mov    %rdx,%r9
 8e5:	5e                   	pop    %rsi
 8e6:	48 89 e2             	mov    %rsp,%rdx
 8e9:	48 83 e4 f0          	and    $0xfffffffffffffff0,%rsp
 8ed:	50                   	push   %rax
 8ee:	54                   	push   %rsp
 8ef:	4c 8d 05 8a 02 00 00 	lea    0x28a(%rip),%r8        # b80 <__libc_csu_fini>
 8f6:	48 8d 0d 13 02 00 00 	lea    0x213(%rip),%rcx        # b10 <__libc_csu_init>
 8fd:	48 8d 3d cc ff ff ff 	lea    -0x34(%rip),%rdi        # 8d0 <main>
 904:	ff 15 d6 16 20 00    	callq  *0x2016d6(%rip)        # 201fe0 <__libc_start_main@GLIBC_2.2.5>
 90a:	f4                   	hlt    
 90b:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)

Look what happens to sit right after main(): it’s _start, which is the operating system’s entry point. So after falling off the end of main, it runs into that and so simply restarts the program entirely from scratch. The program then makes a call to main which never returns because it runs off the end… and you get the idea. Eventually the program exhausts the stack and segfaults. If you change the optimization settings of the file with the stubs in, the behaviour changes, because a different piece of code gets randomly wandered into.

I thought this was fascinating: I know that undefined behaviour can in principle do some very strange things but it’s interesting to see an example where it really does. I would never have predicted infinite recursion as an outcome. So don’t mess with undefined behaviour: one day the compiler might really make demons fly from your nose.