> the vast bulk of sanitizer complaints came from invoking undefined or implementation-defined behavior in harmless ways
This is patently false. Any Undefined Behavior is harmful because it allows the optimizer to insert totally random code, and this is not a purely theoretical behavior, it's been repeatedly demonstrated happening. So even if your UB code isn't called, the simple fact it exists may make some seemingly-unrelated code behave wrongly.
It may theoretically be false, and probably false in some cases, but (at least temporarily) there are cases (not all! but some) where some of the current C compilers currently will never result in bad behavior even if UB actually happens, for now, and those do include some of those mentioned in the article. (not all though; e.g. the actually-used cases of signed overflow could behave badly; of course if one looks at the assembly and makes sure it's what you want, it will be,..as long as the code is always compiled by the specific version of the checked compiler)
For example, in clang/llvm, currently, doing arithmetic UB (signed overflow, out-of-range shifts, offsetting a pointer outside its allocation bounds, offsetting a null pointer, converting an out-of-range float to int, etc) will never result in anything bad, as long as you don't use it (where "using it" includes branching on or using as a load/store address or returning from a function a value derived from it, but doesn't include keeping it in a variable, doing further arithmetic, or even loading/storing it). Of course that's subject to change and not actually guaranteed by any documentation. Not a thing to rely on, but currently you won't ever need to release an emergency fix and get a CVE number for having "void *mem = malloc(10); void *tmp[1]; tmp[0] = mem-((int)two_billion + (int)two_billion); if (two_billion == 0) foo(tmp); free(mem);" in your codebase (..at least if compiling with clang; don't know about other compilers). (yes, that's an immense amount of caveats for an "uhh technically")
> So even if your UB code isn't called, the simple fact it exists may make some seemingly-unrelated code behave wrongly.
This is fortunately not true. If it were, it would make runtime checks pointless. Consider this code
free(ptr)
already_freed = true;
if (!alread_freed) {
free(ptr)
}
The second free would be undefined behavior, but since it doesn't run the snippet is fine. > This is patently false. Any Undefined Behavior is harmful because it allows the optimizer to insert totally random code
Undefined to who, though? Specific platforms and toolchains have always attached defined behavior to stuff the standard lists as undefined, and provided ways (e.g. toolchain-specific volatile semantics, memory barriers, intrinsic functions) to exploit that. Even things like inline assembly live in this space of dancing around what the standard allows. And real systems have been written to those tools, successfully. At the bottom of the stack, you basically always have to deal with stuff like this.
Your point is a pedantic proscription, basically. It's (heh) "patently false" to say that "Any Undefined Behavior is harmful".
Yeah, this is especially true if you're writing a libc. E.g., every libc allocator in existence invokes UB with respect to ISO C when it reads metadata placed before a malloc()ed block of memory. Doubly so, since ISO C arguably doesn't even allow a container_of() mechanism at all. At some point, you have to look at what the implementation is actually expecting of your code.
To pick a slightly less obvious example, I doubt (but haven't tried to prove) that it's possible to use the POSIX ancillary data API for Unix domain sockets (i.e., SCM_RIGHTS) without invoking UB.
I think that's technically possible, but you have to use a freshly allocated (and zeroed) buffer every single time, since after it's been written to once, it's 'tainted' with the effective types of the structs stored in it. (Though it looks like the ISO C2y draft finally has language to address this case, with a concept of "byte arrays" that can hold objects of other types, as long as you keep alignments in mind.)
The bigger issue with ISO C and POSIX is everything around 'struct sockaddr': you don't have any way of knowing what types the implementation is internally reading in or writing out. If you give it a casted pointer to a 'struct sockaddr_in' but it reads the sa_family from the 'struct sockaddr *', that's UB; ditto if listen() gives you a 'struct sockaddr_in' and you read the sa_family from a 'struct sockaddr *'. Or if you use 'struct sockaddr_storage' at all, that's also UB. IIRC, the latest POSIX edition just tells implementations to "pretty please allow aliasing between these types in particular!"
Of course, POSIX has nothing on Windows APIs, many of which encourage the caller to cast around pointers with impunity. As far as I'm aware, MSVC doesn't care about strict aliasing at all, and only has a minimal set of optimizations for 'restrict' pointers.
Or use the results of mmap() or futex() system calls, or to model the behavior around barrier/serializing instruction. MMIO is likewise right out. It's just asking too much of the poor language standard to rigorously specify every possible thing you might do with C syntax, even though many of those things can be very valuably implemented in high level languages.
So they punted and left it up to the toolchains, and the toolchains admirably picked up the slack and provided good tools. The problem then becomes the pedants who invent rules like "any UB is harmful" above, not realizing the UB-reliant code is plugging the holes keeping their system afloat.
Ubsan docs do mention one case where UB is defined by the implementation: floating point division by zero: https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html
By contrast, I'd assume any other report by ubsan to be fair game for the optimizer to do its thing and generate whatever code is going to be different from what was likely the developer's intention. If not in the current version, maybe in a future one.
Unless the optimizer has rules around its own intrinsics or whatever, though. Again, this isn't a black/white issue. You need to know what specific toolchains are going to do when writing base level software like C libraries (or RTOS kernels, which is my wheelhouse). The semantics defined by the standard are not sufficient.
Memory barrier builtins, inline assembly and other builtins may not be defined by the standard itself but won't lead in themselves to undefined behavior in my understanding since those are compiler extensions and defined by the implementation. (though invalid use can lead to UB, such as passing 0 to builtin_clz or modifying registers outside of the clobber list in inline assembly)
With that being said, I would definitely expect that the small set of UB that ubsan reports about is actually undefined for the compiler that implements the sanitizer (meaning: either problematic now or problematic in some future update).
I agree. The typical ubsan sanitizers (not all sanitizers though) detect serious issues which should be fixed in all cases and I would definitely consider it best practice to run testing with sanitizers (and I would also would recommend to run many in production).
> optimizer to insert totally random code
What are you even saying - what is your definition of "random code". FYI UB is exactly (one of) the places where an optimizer can insert optimized code.
To take an example from the post: in some cases a value was computed that could overflow, but it was not used because of a later overflow check. I think the optimizer would be fully within its rights to delete the code inside the overflow check, because the computation implicitly asserts that it won't overflow (since overflow is undefined). I think this is a more or less useful way of thinking around UB: any operation you put in your program implicitly asserts that the values are such that UB won't happen. For example, dereferencing a pointer implicitly means it cannot be NULL, because derefing NULL is UB, and anything downstream of that deref which checks for NULL can be deleted.
Unfortunately UB is an umbrella term for all sorts of things, and some of those can be very harmful/unexpected, while others are (currently) harmless - but that may change in new compiler versions.
The typical optimization showcase (better code generation for signed integer loop counts) only works when the (undefined behaviour) signed integer overflow doesn't actually happen (e.g. the compiler is free to assume that the loop count won't overflow). But when the signed integer overflow happens all bets are off what will actually happen to the control flow - while that same signed integer overflow in another place may simply wrap around.
Another similar example is to specifically 'inject' UB by putting a `std::unreachable` into the default case of a switch statement. This enables an optimization that the compiler omits a range check before accessing the switch-case jump table. But if the switch-variable isn't handled in a case-branch, the jump table access may be out-of-bounds and there will be a jump to a random location.
In other situations the compiler might even be able to detect at compile time that the UB is triggered and simply generate broken code (usually optimizing away some critical part), or if you're lucky the compiler inserts an ud instruction which crashes the process.
Not OP, but here's an example of "random code" inserted by the compiler[1]: note the assembly instruction "ud2" ("invalid opcode exception" in x86 land) instead of "ret" in not_ok().
You might think this code would be fine if address 0 were mapped to RAM, but both gcc and clang know it's undefined behavior to use the null pointer like that, so they add "random code" that forces a processor exception.