"Fun bug of the month, mesa edition, episode may"
so if you do "uint64_t some_var = 1 << 31;" in C you get "0xffffffff80000000" as the value, because that's super obvious and not confusing at all.
It's pretty funny getting reminded how non-intuitive and broken C is from time to time.
@karolherbst For my understanding: That's default int promotion + sign extend on 64 bit extension? Would 1L << 31L fix this or is there other pitfalls with that?
@trilader yeah sure, but any competent and modern language would type the constant to what's expected, not make it int32 by default, because that's just broken imho.
Like any new language doing that today would be considered broken on arrival.
@karolherbst Yeah. Things like this make me think someone needs to invent -fbackwards-compatible-bs=off
@karolherbst I think that's UB? see C99 6.5.7 "Bitwise shift operators" - the LHS is signed and the result of the computation is not representable in the result type
@karolherbst but apparently gcc has decided to not treat it as UB, except when using UBSAN: https://gcc.gnu.org/onlinedocs/gcc/Integers-implementation.html
@jann yeah technically it's UB, but there is only so much you can optimize with a 1-2 instruction pattern that it doesn't really matter in practice, because most impls will do the same (more or less).
Like there is UB and then there is UB.
@karolherbst yeah, I guess my point is that, for the code you showed, a C compiler would be well within its rights to refuse to build that code or complain about it, so this is not entirely the language's fault
It was not UB in C90. That is why it was UB without ubsan ...
@jann ohh it's totally the languages fault even if it wouldn't be UB, because that's just the worst way to specify this.
Like it's just a design bug really. And no matter how much this is UB or not won't change that.
It’s UB in the general case because, if the operand is not a constant, you want to lower it to a shift instruction but C works with targets that have different number representations. Ones or twos complements, or explicit sign bits are all permitted, but all of these will give different behaviours if you flip the top bit.
For wider shifts, different ISAs had different semantics for shifts wider than the register, so C made that fully undefined.
This combination lets you lower source-level shifts to a shift instruction.
C also doesn’t mandate that this be constant evaluated unless the result is used as a constant, so there’s no way to force implementations to diagnose the UB at compile time for this case. But, as a QoI issue, it is permitted and compilers should.
@david_chisnall @jann at least C23 fixes one part of this by requiring two's complement for integers.
But also, I just wished C would mandate that constants are just assumed to be of the "expected" type, because in 99.999999% of all cases a programmer really meant the obvious thing with "uint64_t x = 1 << 31".
But I guess we'll just keep those horrible semantics C has in a couple of areas, because nobody want to fix those things, because "it could break things".
@karolherbst taking off my "understands sign extension" badge
wore it with pride, but the pride was misplaced
@trilader @karolherbst 1L << alone would do the trick on real world 64-bit machines, but i think compiler is still fully allowed to do the wrong thing. msvc perhaps still has 32-bit longs on 64-bit platforms?
i think you need to make sure it's unsigned so that sign extension has no chance of occurring, so my money is on 1U << 31?
@LunaDragofelis yes by default. It's _technically_ UB, but the point is rather that it's super non intuitive.
Like technically 1 << 32 is also 0, but the example here in combination with sign-extension and language UB is really nasty :)
@LunaDragofelis but also... in gcc specifically it's not UB: "As an extension to the C language, GCC does not use the latitude given in C99 and later to treat certain aspects of signed ‘<<’ as undefined."
@karolherbst I've been bitten by this before because #pronelang has a long list of types that can be encoded into a 64 bit integer (8 of the bits being type tag, the rest being data). This means a lot of shift ops at a modern word size, especially in the test suite.
@pavel @karolherbst @trilader and in some 64-bit architectures too, or rather some 64-bit ABIs: ILP32 (x32), LLP64 (Windows).
The standard-compliant version would be either a uint64_t cast, as suggested elsewhere, or UINT64_C(1)
@david_chisnall @karolherbst @jann yeah, but this is one of those UBs that should be Implementation-defined rather than Undefined.
Implementation defined doesn't help when a target CPU's behaviour is that the destination register contains an unspecified value.
@david_chisnall @oblomov @jann that's all nice and well, but gcc defines this behavior (or rather doesn't treat such shifts as undefined) and that means most people will see the behavior described.
Which, fun having academic discussions aside, is the thing that really matters here anyway.
And if a compiler defines a behavior, then I would sure hope it deals with all the weirdo CPUs out there, because otherwise it would be a compiler bug.
@karolherbst @pavel @trilader ...but it's tricky, because long-time users of C and C++ prioritize backward compatibility over everything else.
a change to the language that silently changes the meaning of pre-existing code would be... upsetting, to say the least.
@karolherbst @pavel @trilader
in theory you could introduce a way to specify semantics policies in a pragma or something, and then compilers could diagnose TUs that don't indicate their semantics policies...?
i dunno; it's hard...
@karolherbst @pavel @trilader
i think if you want better semantics but you need to interface with old code, the remedy is most likely to come from something like the Carbon language project.
@a1ba oh yeah, we also have tons of macros for all those things, but sometimes with refactors and such sometimes things fall through cracks.
@a1ba it shouldn't tho 😭 this is just such an awful part of the language.
@pavel @karolherbst @trilader
> For example 1 << 32 could be defined as 0 for 32 bit integers...
but...first of all, as mentioned elsewhere in this thread, gcc already defined its behavior for 1 << 32, and gcc's defined behavior is the sign extension seen in the first post.
so if the ISO C standard were to change so that (1 << 32) == 0, then for users of gcc (and compilers that emulate gcc, like clang and EDG-based compilers), the meaning of pre-existing code would quietly change.
@pavel @karolherbst @trilader
second, i was replying to karolherbst's suggestion that the language should determine the type of (1 << 32) based on "what's expected", apparently with the assumption that the user expects the C compiler to apply type-inference rules of languages like Haskell or Swift (where 1 and 32 are essentially typeless without some other hint, like the type of some_var above)...
@pavel @karolherbst @trilader
however, if a programmer has been using C many years, they would be accustomed to its particular typing rules. so "what's expected" is really subjective.
like i said: it's hard.
@pavel finally: if you want minimum surprises in the event of UB...
unfortunately the best options are:
(1) enable sanitizers, or
(2) if you can't enable sanitizers, disable optimization.
this has been argued to death about a billion times since 2010 (and i've always been on the "make UB less surprising" side of those arguments.)
it's a shite state of affairs to be in. best we can do is move on to other languages.
@pavel ok, but... how?
what rule would you add or change in ISO C to get the effect that you want?
and how would you write the new rule so that it doesn't break or quietly change existing C code?
@pavel if you don't have access to the ISO C working draft, you may find it easier to work with the corresponding C++ wording, which has essentially the same effect and is a little more accessible online.
the relevant existing rules are in [expr.shift]:
https://eel.is/c++draft/expr.shift#1.sentence-4
...and [expr.pre]:
https://eel.is/c++draft/expr.pre#4
also: definition of "UB":
https://eel.is/c++draft/defns.undefined
(it is unlikely that the definition of "undefined behavior" will change the way you initially think you want it to change.)
@pavel sanitizer options can already turn UB into abort! and it works just fine in -O2 mode. the clang option you want is:
-fno-sanitize-recover
see also:
https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html
it's also accepted by GCC:
https://gcc.gnu.org/onlinedocs/gcc/Instrumentation-Options.html
@pavel i have my build environments set up so that they always build with -fno-sanitize-recover by default, and it's great, because it produces the effect of preventing the compiler from treating UB as unreachable (at least for the forms of UB for which sanitizers are enabled).
@pavel fyi, someone pointed out to me that -fno-sanitize-recover opens up security concerns because of a new attack surface opened up e.g. by the ability to use a sanitizer-controlling environment variable to set a log file for sanitizer output; however, there is a remedy:
"For security-sensitive applications consider using Minimal Runtime or trap mode for all checks."
https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html#security-considerations
so that's -fsanitize-minimal-runtime
or
-fsanitize-trap
@pavel
(i haven't actually dug into the proof-of-concept exploit yet but i plan to soon)